Book contents
- Designing and Evaluating Language Corpora
- Designing and Evaluating Language Corpora
- Copyright page
- Contents
- Figures
- Tables
- Acknowledgments
- 1 Introduction
- 2 Approaches to Representativeness in Previous Corpus Linguistic Research
- 3 Corpus Representativeness
- 4 Domain Considerations
- 5 Distribution Considerations
- 6 The Influence of Domain and Distribution Considerations on Corpus Representativeness
- 7 Corpus Design and Representativeness in Practice – With Daniel Keller
- Glossary
- Book part
- References
- Index
5 - Distribution Considerations
Published online by Cambridge University Press: 07 April 2022
- Designing and Evaluating Language Corpora
- Designing and Evaluating Language Corpora
- Copyright page
- Contents
- Figures
- Tables
- Acknowledgments
- 1 Introduction
- 2 Approaches to Representativeness in Previous Corpus Linguistic Research
- 3 Corpus Representativeness
- 4 Domain Considerations
- 5 Distribution Considerations
- 6 The Influence of Domain and Distribution Considerations on Corpus Representativeness
- 7 Corpus Design and Representativeness in Practice – With Daniel Keller
- Glossary
- Book part
- References
- Index
Summary
We define a linguistic distribution as the range of values for a quantitative linguistic variable across the texts in a corpus. An accurate parameter estimate means that the measures based on the corpus are close to the actual values of a parameter in the domain. Precision refers to whether or not the corpus is large enough to reliably capture the distribution of a particular linguistic feature. Distribution considerations relate to the question of how many texts are needed. The answer will vary depending on the nature of the linguistic variable of interest. Linguistic variables can be categorized broadly as linguistic tokens (rates of occurrence for a feature) and linguistic types (the number of different items that occur). The distribution considerations for linguistic tokens and linguistic types are fundamentally different. Corpora can be “undersampled” or “oversampled” – neither of which is desirable. Statistical measures can be used to evaluate corpus size relative to research goals – one set of measures enables researchers to determine the required sample size for a new corpus, while another provides a means to determine precision for an existing corpus. The adage “bigger is better” aptly captures our best recommendation for studies of words and other linguistic types.
Keywords
- Type
- Chapter
- Information
- Designing and Evaluating Language CorporaA Practical Framework for Corpus Representativeness, pp. 122 - 155Publisher: Cambridge University PressPrint publication year: 2022
- 1
- Cited by