Skip to main content Accessibility help
×
Hostname: page-component-745bb68f8f-lrblm Total loading time: 0 Render date: 2025-01-25T21:37:03.149Z Has data issue: false hasContentIssue false

Programming for Corpus Linguistics with Python and Dataframes

Published online by Cambridge University Press:  24 May 2024

Daniel Keller
Affiliation:
Western Kentucky University

Summary

This Element offers intermediate or experienced programmers algorithms for Corpus Linguistic (CL) programming in the Python language using dataframes that provide a fast, efficient, intuitive set of methods for working with large, complex datasets such as corpora. This Element demonstrates principles of dataframe programming applied to CL analyses, as well as complete algorithms for creating concordances; producing lists of collocates, keywords, and lexical bundles; and performing key feature analysis. An additional algorithm for creating dataframe corpora is presented including methods for tokenizing, part-of-speech tagging, and lemmatizing using spaCy. This Element provides a set of core skills that can be applied to a range of CL research questions, as well as to original analyses not possible with existing corpus software.
Get access
Type
Element
Information
Online ISBN: 9781108904094
Publisher: Cambridge University Press
Print publication: 20 June 2024

Access options

Get access to the full version of this content by using one of the access options below. (Log in options will check for institutional or personal access. Content may require purchase if you do not have access.)

References

Anthony, L. (2020). Programming for corpus linguistics. In Paquot, M. and Gries, S. T., eds. Practical Handbook of Corpus Linguistics. Springer, pp. 181207.CrossRefGoogle Scholar
Biber, D., Conrad, S., & Cortes, V. (2004). If you look at … : Lexical bundles in university teaching and textbooks. Applied Linguistics, 25(3), 371405.CrossRefGoogle Scholar
Biber, D., & Egbert, J. (2018). Register Variation Online. Cambridge University Press.CrossRefGoogle Scholar
Brezina, V. (2018). Statistics in Corpus Linguistics: A Practical Guide. Cambridge University Press.CrossRefGoogle Scholar
Dunning, T. E. (1993). Accurate methods for the statistics of surprise and coincidence. Computational Linguistics, 19(1), 6174.Google Scholar
Egbert, J., & Biber, D. (2019). Incorporating text dispersion into keyword analyses. Corpora, 14(1), 77104.CrossRefGoogle Scholar
Egbert, J., & Biber, D. (2023). Key feature analysis: A simple, yet powerful method for comparing text varieties. Corpora, 18(1), 121133.CrossRefGoogle Scholar
Gabrielatos, C. (2018). Keyness analysis: Nature, metrics and techniques. In Taylor, C. & Marchi, A., eds. Corpus Approaches to Discourse: A Critical Review. Routledge, pp. 225258.CrossRefGoogle Scholar
Hetland, M. L. (2014). Python Algorithms: Mastering Basic Algorithms in the Python Language. Apress.CrossRefGoogle Scholar
Honnibal, M., Montani, I., Van Landeghem, S., & Boyd, A. (2020). spaCy: Industrial-strength natural language processing in Python. https://spacy.io/Google Scholar
Ide, N., & Suderman, K. (2004, May). The American National Corpus First Release. In Proceedings of the Fourth International Conference on Language Resources and Evaluation (LREC’04), Lisbon, Portugal. European Language Resources Association (ELRA). https://aclanthology.org/L04-1313/Google Scholar
Lee, K. D., & Hubbard, S. H. (2015). Data Structures and Algorithms with Python. Springer.CrossRefGoogle Scholar
Nivre, J., Agić, Ž., Ahrenberg, L. et al. (2017). Universal Dependencies 2.1. https://universaldependencies.org/u/pos/Google Scholar
Rayson, P. (n.d.). Log-likelihood and effect size calculator. http://ucrel.lancs.ac.uk/llwizard.htmlGoogle Scholar
Rychlý, P. (2008). A lexicographer-friendly association score. Proceedings from Recent Advances in Slavonic Natural Language Processing (pp. 69). Karlova Studánka, Czech Republic: Masaryk University. nlp.fi.muni.cz/raslan/2008/raslan08.pdfGoogle Scholar

Save element to Kindle

To save this element to your Kindle, first ensure [email protected] is added to your Approved Personal Document E-mail List under your Personal Document Settings on the Manage Your Content and Devices page of your Amazon account. Then enter the ‘name’ part of your Kindle email address below. Find out more about saving to your Kindle.

Note you can select to save to either the @free.kindle.com or @kindle.com variations. ‘@free.kindle.com’ emails are free but can only be saved to your device when it is connected to wi-fi. ‘@kindle.com’ emails can be delivered even when you are not connected to wi-fi, but note that service fees apply.

Find out more about the Kindle Personal Document Service.

Programming for Corpus Linguistics with Python and Dataframes
  • Daniel Keller, Western Kentucky University
  • Online ISBN: 9781108904094
Available formats
×

Save element to Dropbox

To save content items to your account, please confirm that you agree to abide by our usage policies. If this is the first time you use this feature, you will be asked to authorise Cambridge Core to connect with your account. Find out more about saving content to Dropbox.

Programming for Corpus Linguistics with Python and Dataframes
  • Daniel Keller, Western Kentucky University
  • Online ISBN: 9781108904094
Available formats
×

Save element to Google Drive

To save content items to your account, please confirm that you agree to abide by our usage policies. If this is the first time you use this feature, you will be asked to authorise Cambridge Core to connect with your account. Find out more about saving content to Google Drive.

Programming for Corpus Linguistics with Python and Dataframes
  • Daniel Keller, Western Kentucky University
  • Online ISBN: 9781108904094
Available formats
×