Book contents
- Designing and Evaluating Language Corpora
- Designing and Evaluating Language Corpora
- Copyright page
- Contents
- Figures
- Tables
- Acknowledgments
- 1 Introduction
- 2 Approaches to Representativeness in Previous Corpus Linguistic Research
- 3 Corpus Representativeness
- 4 Domain Considerations
- 5 Distribution Considerations
- 6 The Influence of Domain and Distribution Considerations on Corpus Representativeness
- 7 Corpus Design and Representativeness in Practice – With Daniel Keller
- Glossary
- Book part
- References
- Index
4 - Domain Considerations
Published online by Cambridge University Press: 07 April 2022
- Designing and Evaluating Language Corpora
- Designing and Evaluating Language Corpora
- Copyright page
- Contents
- Figures
- Tables
- Acknowledgments
- 1 Introduction
- 2 Approaches to Representativeness in Previous Corpus Linguistic Research
- 3 Corpus Representativeness
- 4 Domain Considerations
- 5 Distribution Considerations
- 6 The Influence of Domain and Distribution Considerations on Corpus Representativeness
- 7 Corpus Design and Representativeness in Practice – With Daniel Keller
- Glossary
- Book part
- References
- Index
Summary
We show that attending to domain considerations in corpus design involves three steps: (1) describing the domain as fully as possible; (2) operationalizing the domain; (3) sampling the texts. Describing the domain requires defining the boundaries of the domain: what texts belong within the domain and what do not? Describing the domain requires identifying important internal categories of texts that reflect qualitative variation within the domain. Domain description should be carried out systematically using a range of sources that can be evaluated for quality and triangulated. Operationalizing the domain refers to specifying the set of texts that are available for sampling; operational domains are always precisely bounded and specified. A sampling frame is an itemized list of all texts (from the operational domain) that are available for sampling. A sampling unit is the individual “object” (usually a text) that will be included in the corpus. Stratification is the process of collecting texts according to identified categories within the domain, and is usually desirable in corpus design. Proportionality refers to the relative sizes of strata within the sample. Strata can be proportional or equal-sized. Sampling methods can be broadly categorized as random and nonrandom.
Keywords
- Type
- Chapter
- Information
- Designing and Evaluating Language CorporaA Practical Framework for Corpus Representativeness, pp. 68 - 121Publisher: Cambridge University PressPrint publication year: 2022
- 1
- Cited by