Comparing Standard Reference Corpora and Google Books Ngrams

doi:10.1017/9781108589314.002

1 - Comparing Standard Reference Corpora and Google Books Ngrams

Strengths, Limitations and Synergies in the Contrastive Study of Variable h- in British and American English

from Part I - Corpus Dimensions and the Viability of Methodological Approaches

Published online by Cambridge University Press: 06 May 2022

Lukas Sönning and

Julia Schlüter

Edited by

Ole Schützler and

Julia Schlüter

Show author details

Ole Schützler: Affiliation:
Universität Leipzig
Julia Schlüter: Affiliation:
Universität Bamberg

Book contents

Get access

Summary

This chapter is based on two standard reference corpora, the British National Corpus and the Corpus of Contemporary American English, as opposed to the multi-billion-word database of Google Books Ngrams, which has, despite its allure, not been used in many systematic linguistic studies so far. Focusing on indefinite article allomorphy (a vs an) as an orthographic cue to the phonological strength of ‹h›-onsets in British and American English, the size advantage of the Ngrams database expectedly plays out in larger type and token counts, more stable estimates and fewer distortions due to data sparsity. However, as metadata are extremely limited (to year and variety), a fully accountable analysis is not feasible. The case study illustrates how richly annotated corpora can shed light on potential disturbances arising from two sources: genre differences and between-author variability. A sensitivity analysis offers some degree of reassurance when extending the analysis to the Ngrams database. In this way, the authors demonstrate that the strengths and limitations of corpora and big data resources can, with due caution, be counterbalanced to answer questions of linguistic interest.

Keywords

Google Books Ngrams big data metadata type frequency token frequency hierarchical data structure corpus comparability data quality

Type: Chapter
Information: Data and Methods in Corpus Linguistics
Comparative Approaches
, pp. 17 - 45

DOI: https://doi.org/10.1017/9781108589314.002 [Opens in a new window]

Publisher: Cambridge University Press

Print publication year: 2022

Access options

Get access to the full version of this content by using one of the access options below. (Log in options will check for institutional or personal access. Content may require purchase if you do not have access.)

Book purchase

Temporarily unavailable

References

Algeo, John. 2006. British or American English? A Handbook of Word and Grammar Patterns. Cambridge: Cambridge University Press.Google Scholar

Barth, Danielle, and Kapatsinski, Vsevolod. 2018. Evaluating Logistic Mixed-Effects Models of Corpus-Linguistic Data in Light of Lexical Diffusion. In Speelman, Dirk, Heylen, Kris and Geeraerts, Dirk, eds. Mixed-Effects Regression Models in Linguistics. New York: Springer. 99–116.Google Scholar

Biber, Douglas. 1998. Corpus Linguistics: Investigating Language Structure and Use. Cambridge: Cambridge University Press.Google Scholar

Biber, Douglas, and Gray, Bethany. 2013. Being Specific about Historical Change: The Influence of Sub-Register. Journal of English Linguistics 41(2). 104–34. http://eng.sagepub.com/cgi/doi/10.1177/0075424212472509.CrossRef Google Scholar

Burnard, Lou, ed. 2007. Reference Guide for the British National Corpus (XML edition). British National Corpus Consortium & Research Technologies Service at Oxford University Computing Services. www.natcorp.ox.ac.uk/docs/URG/BNCdes.html.Google Scholar

Cruttenden, Alan. 2014. Gimson’s Pronunciation of English. 8th ed. London: Arnold.Google Scholar

Davies, Mark. 2008–. The Corpus of Contemporary American English (COCA): 600 Million Words, 1990–Present. www.english-corpora.org/coca.Google Scholar

Desgraupes, Bernard, and Loiseau, Sylvain. 2018. rcqp: Interface to the Corpus Query Protocol. R package version 0.5. https://CRAN.R-project.org/package=rcqp.Google Scholar

Elwert, Felix, and Winship, Christopher. 2014. Endogenous Selection Bias: The Problem of Conditioning on a Collider Variable. Annual Review of Sociology 40. 31–53. https://doi.org/10.1146/annurev-soc-071913-043455.CrossRef Google Scholar PubMed

Firth, David. 1993. Bias Reduction of Maximum Likelihood Estimates. Biometrika 80(1). 27–38. https://doi.org/10.2307/2336755.CrossRef Google Scholar

Gelman, Andrew, and Greenland, Sander. 2019. Are Confidence Intervals Better Termed ‘Uncertainty Intervals’? British Medical Journal 366(l5381). https://doi.org/10.1136/bmj.l5381.Google Scholar

Greenland, Sander, Mansourina, Mohammad Ali and Altman, Douglas G. 2016. Sparse Data Bias: A Problem Hiding in Plain Sight. British Medical Journal 352 (i1982). https://doi.org/10.1136/bmj.i1981.Google Scholar

Hiltunen, Turo, McVeigh, Joe and Säily, Tanja. 2017. How to Turn Linguistic Data into Evidence? In Hiltunen, Turo, McVeigh, Joe and Säily, Tanja, eds. Big and Rich Data in English Corpus Linguistics: Methods and Explorations. Studies in Variation, Contacts and Change in English 19. www.helsinki.fi/varieng/series/volumes/19/introduction.html.Google Scholar

Johnson, Daniel E. 2014. Progress in Regression: Why Natural Language Data Calls for Mixed-Effects Models. Unpublished manuscript. www.danielezrajohnson.com/johnson_2014b.pdf.Google Scholar

Jones, Daniel. 2011. English Pronouncing Dictionary (EPD). Edited by Roach, Peter, Setter, Jane and Esling, John. 18th ed. Cambridge: Cambridge University Press. CD-ROM edition.Google Scholar

Koplenig, Alexander. 2017. The Impact of Lacking Metadata for the Measurement of Cultural and Linguistic Change Using the Google Ngram Data Sets: Reconstructing the Composition of the German Corpus in Times of WWII. Digital Scholarship in the Humanities 21(1). 169–88. https://doi.org/10.1093/llc/fqv037.Google Scholar

Lass, Roger, and Laing, Margaret. 2010. In Celebration of Early Middle English ‘H’. Neuphilologische Mitteilungen 111(3). 345–54.Google Scholar

Michel, Jean-Baptiste, Shen, Yuan Kui, Aiden, Aviva Presser et al. 2010. Quantitative Analysis of Culture Using Millions of Digitized Books. Science 331(6014). 176–82. https://doi.org/10.1126/science.1199644.Google Scholar

Minkova, Donka, ed. 2009. Phonological Weakness in English: From Old to Present-Day English. Basingstoke and New York: Palgrave Macmillan.CrossRef Google Scholar

Minkova, Donka. 2014. A Historical Phonology of English. Edinburgh: Edinburgh University Press.Google Scholar

OED (Oxford English Dictionary Online). 2000–. Oxford: Oxford University Press. http://dictionary.oed.com/ (accessed 3 March 2020).Google Scholar

Pechenick, Eitan, Danforth, Christopher M. and Dodds, Peter Sheridan. 2015. Characterizing the Google Books Corpus: Strong Limits to Inferences of Socio-Cultural and Linguistic Evolution. PLoS ONE 10(10), e0137041. https://doi.org/10.1371/journal.pone.0137041.Google Scholar

Peters, Pam. 2004. The Cambridge Guide to English Usage. Cambridge: Cambridge University Press.Google Scholar

R Core Team. 2019. R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing, Vienna, Austria. www.R-project.org/.Google Scholar

Scherer, Ralph. 2018. PropCIs: Various Confidence Interval Methods for Proportions. R package version 0.3–0. https://CRAN.R-project.org/package=PropCIs.Google Scholar

Schlüter, Julia. 2019. Tracing the (Re-)Emergence of /h/ and /j/ through 350 Years of Books: Mergers and Merger Reversals at the Interface of Phonetics and Phonology. Folia Linguistica 40(s1). Special issue on diachronic phonotactics. Edited by Nikolaus Ritt, Andreas Baumann and Christina Prömer. 177–202. https://doi.org/10.1515/flih-2019-0009.Google Scholar

Schlüter, Julia, and Vetter, Fabian. 2020. An Interactive Visualization of Google Books Ngrams with R and Shiny: Exploring a(n) Historical Increase in Onset Strength in a(n) Huge Database. Journal of Data Mining and Digital Humanities 21. Special issue on visualizations in historical linguistics. Edited by Benjamin Molineaux, Bettelou Los and Martti Mäkinen. https://jdmdh.episciences.org/7000.Google Scholar

Speelman, Dirk, Heylen, Kris and Geeraerts, Dirk, eds. 2018. Mixed-Effects Regression Models in Linguistics. New York: Springer.Google Scholar

Steel, E. Ashley, Liermann, Martin and Guttorp, Peter. 2019. Beyond Calculations: A Course in Statistical Thinking. The American Statistician 73. 392–401. https://doi.org/10.1080/00031305.2018.1505657.CrossRef Google Scholar

Wells, John. 2008. Longman Pronunciation Dictionary (LPD). 3rd ed. Harlow: Pearson Longman. CD-ROM edition: Longman Pronunciation Coach.Google Scholar

Winter, Bodo. 2020. Statistics for Linguistics. New York: Routledge.Google Scholar

Winter, Bodo, and Grice, Martine. 2021. Independence and Generalizability in Linguistics. Linguistics 59(5). 1251–77.CrossRef Google Scholar

Book contents

1 - Comparing Standard Reference Corpora and Google Books Ngrams

Summary

Keywords

Access options

Book purchase

Temporarily unavailable

References

Further Reading

References

Save book to Kindle

Save book to Dropbox

Save book to Google Drive