Book contents
- Designing and Evaluating Language Corpora
- Designing and Evaluating Language Corpora
- Copyright page
- Contents
- Figures
- Tables
- Acknowledgments
- 1 Introduction
- 2 Approaches to Representativeness in Previous Corpus Linguistic Research
- 3 Corpus Representativeness
- 4 Domain Considerations
- 5 Distribution Considerations
- 6 The Influence of Domain and Distribution Considerations on Corpus Representativeness
- 7 Corpus Design and Representativeness in Practice – With Daniel Keller
- Glossary
- Book part
- References
- Index
2 - Approaches to Representativeness in Previous Corpus Linguistic Research
Published online by Cambridge University Press: 07 April 2022
- Designing and Evaluating Language Corpora
- Designing and Evaluating Language Corpora
- Copyright page
- Contents
- Figures
- Tables
- Acknowledgments
- 1 Introduction
- 2 Approaches to Representativeness in Previous Corpus Linguistic Research
- 3 Corpus Representativeness
- 4 Domain Considerations
- 5 Distribution Considerations
- 6 The Influence of Domain and Distribution Considerations on Corpus Representativeness
- 7 Corpus Design and Representativeness in Practice – With Daniel Keller
- Glossary
- Book part
- References
- Index
Summary
We demonstrate that there is little consensus on what representativeness is, either in statistics or in corpus linguistics. Representative is a general term that must be made specific within a particular context in order to evaluate a sample. We introduce ten attested conceptualizations of corpus representativeness: (1) representativeness as “general acclaim for data”; (2) a representative corpus has been collected with the “absence of selective focus”; (3) a representative corpus contains texts that are “typical or ideal cases” of the target domain; (4) a representative corpus is a “miniature of the population”; (5) a representative corpus achieves “coverage of the population’s heterogeneity”; (6) a representative corpus “permits good estimation”; (7) a representative corpus is a corpus that is “good enough for a particular purpose”; (8) a large corpus is more important than a representative corpus; (9) a representative corpus is a “balanced” corpus; (10) a representative corpus is never possible. The term “balance” does not have a single agreed-upon definition in CL, and in fact, is often defined in contradictory ways. A unified and operational definition of corpus representativeness is needed.
- Type
- Chapter
- Information
- Designing and Evaluating Language CorporaA Practical Framework for Corpus Representativeness, pp. 28 - 51Publisher: Cambridge University PressPrint publication year: 2022