Book contents
- English Corpus Linguistics
- English Corpus Linguistics
- Copyright page
- Contents
- Figures
- Tables
- Preface
- Acknowledgments
- 1 The Empirical Study of Language
- 2 Planning the Construction of a Corpus
- 3 Building and Annotating a Corpus
- 4 Analyzing a Corpus
- Concluding Remarks
- Discussion Topics
- Appendix: Corpora
- Bibliography
- Index
3 - Building and Annotating a Corpus
Published online by Cambridge University Press: 15 June 2023
- English Corpus Linguistics
- English Corpus Linguistics
- Copyright page
- Contents
- Figures
- Tables
- Preface
- Acknowledgments
- 1 The Empirical Study of Language
- 2 Planning the Construction of a Corpus
- 3 Building and Annotating a Corpus
- 4 Analyzing a Corpus
- Concluding Remarks
- Discussion Topics
- Appendix: Corpora
- Bibliography
- Index
Summary
This chapter describes the process of creating and annotating a corpus. This process involves, for instance, collecting data (speech and writing), transcribing recorded speech, and adding annotation, markup indicating in a conversation, for instance, when one person’s speech overlaps another speaker. While written texts are relatively easy to collect – most writing is readily available in digital formats – speech, especially spontaneous conversations, has to be transcribed, though voice recognition software has made progress in automating the transcription of certain kinds of speech, such as monologues. Other stages of building a corpus are also discussed, ranging from the administrative (keeping records of texts collected) to transcribing recordings of speech. The chapter concludes with a description of various kinds of textual markup and linguistic annotation that can be added to texts. Topics discussed include how to create a “header” for a particular text. Headers contain various kinds of information. For written texts, the header would include, for instance, the title of the text; the author(s); if published, where it was published. Other textual markup is internal to the text, and in a spoken text would include such information as speaker IDs, and the beginnings and ends of overlapping speech.
Keywords
- Type
- Chapter
- Information
- English Corpus LinguisticsAn Introduction, pp. 77 - 117Publisher: Cambridge University PressPrint publication year: 2023