Skip to main content Accessibility help
×
Hostname: page-component-78c5997874-4rdpn Total loading time: 0 Render date: 2024-11-13T00:51:31.542Z Has data issue: false hasContentIssue false

7 - Native Language Identification on EFCAMDAT

from Part III - Data Driven Models

Published online by Cambridge University Press:  30 November 2017

Xiao Jiang
Affiliation:
Computer Laboratory, University of Cambridge, UK
Yan Huang
Affiliation:
Department of Theoretical and Applied Linguistics, University of Cambridge, UK
Yufan Guo
Affiliation:
IBM Research, USA
Jeroen Geertzen
Affiliation:
Department of Theoretical and Applied Linguistics, University of Cambridge, UK
Theodora Alexopoulou
Affiliation:
Department of Theoretical and Applied Linguistics, University of Cambridge, UK
Lin Sun
Affiliation:
Greedy Intelligence, China
Anna Korhonen
Affiliation:
Department of Theoretical and Applied Linguistics, University of Cambridge, UK
Thierry Poibeau
Affiliation:
Centre National de la Recherche Scientifique (CNRS), Paris
Aline Villavicencio
Affiliation:
Universidade Federal do Rio Grande do Sul, Brazil
Get access

Summary

Abstract

Native Language Identification (NLI) is a task aimed at determining the native language (L1) of learners of second language (L2) on the basis of their written texts. To date, research on NLI has focused on relatively small corpora. We apply NLI to EFCAMDAT, an L2 English learner corpus that is not only multiple times larger than previous L2 corpora but also provides pseudo-longitudinal data across several proficiency levels. Based on accurate machine learning with a wide range of linguistic features, our investigation reveals interesting patterns in the longitudinal data that are useful for both further development of NLI and its application to research on L2 acquisition.

Introduction

Native language identification (NLI) is a task aimed at detecting the native language (L1) of writers on the basis of their second language (L2) production. NLI is important for natural language processing (NLP) applications including language tutoring systems and authorship profiling. Moreover, NLI can offer useful empirical data for research on L2 acquisition. For example, NLI can shed light on how L1 background influences L2 learning, and on differences between the writings of L2 learners across different L1 backgrounds.

To date, studies on NLI have focused on relatively small learner corpora. Furthermore, none of them have investigated the influence of L1s across L2 proficiency levels. Our work takes the first step toward addressing these problems. We apply NLI to EFCAMDAT, the EF-Cambridge Open Language Database (Geertzen, Alexopoulou, and Korhonen, 2013), an open-access L2 learner corpus.

EFCAMDAT consists of writings of learners submitted to Englishtown, the online school of EF. EFCAMDAT stands out for its size, diversity of student backgrounds, and coverage of the proficiency levels. The first release of 2013 (Geertzen, Alexopoulou, and Korhonen, 2013), on which this paper is based, amounts to 30 million words, a corpus multiple times larger than any other available L2 corpora. Using a standard machine learning–based methodology for NLI, we explore the optimal linguistic features for NLI on this data at different proficiency levels. We discover interesting patterns that can be useful for both further development of NLI and its application to research on L2 acquisition.

In this introductory section, we first review the history of research on NLI, and introduce the data sets that have been used in earlier NLI research.We then summarise our contribution briefly.

Type
Chapter
Information
Publisher: Cambridge University Press
Print publication year: 2018

Access options

Get access to the full version of this content by using one of the access options below. (Log in options will check for institutional or personal access. Content may require purchase if you do not have access.)

References

Ahn, Charles S. (2011). “Automatically detecting authors’ native language”. Ph.D. thesis, Monterey, California. Naval Postgraduate School.
Al-Rfou, Rami. (2012). “Detecting English Writing Styles For Non-native Speakers”. In: arXiv preprint arXiv:1211.0498.
Bestgen, Yves, Sylviane, Granger, Jennifer, Thewissen, et al. (2012). “Error patterns and automatic L1 identification”. In: Approaching language transfer through text classification, pp. 127–153.
Bishop, C. M. (2006). Pattern Recognition and Machine Learning. Information Science and Statistics. Springer, pp. 338–339.Google Scholar
Blanchard, Daniel, et al. (2013). “TOEFL11: A corpus of non-native English”. In: Educational Testing Service.
Brooke, Julian, and Graeme, Hirst (2011). ’Native language detection with cheap learner corpora. In: Conference of Learner Corpus Research (LCR2011).
Bykh, Serhiy, and Detmar Meurers (2012). “Native Language Identification Using Recurring N-grams–Investigating Abstraction and Domain Dependence”. In: Proceedings of COLING 2012: Technical Papers, pp. 425–440.
Charniak, Eugene, and Johnson, Mark. (2005). “Coarse-to-fine n-best parsing and Max-Ent discriminative reranking”. In: Proceedings of the 43rd Annual Meeting on Association for Computational Linguistics. Association for Computational Linguistics, pp. 173–180.
De Marneffe, Marie-Catherine, Bill, MacCartney, and Christopher, D Manning (2006). “Generating typed dependency parses from phrase structure parses”. In: Proceedings of LREC, Vol. 6, pp. 449–454.
De Marneffe, Marie-Catherine, and Christopher, D Manning (2008). “Stanford typed dependencies manual”. In: URL http://nlp.stanford.edu/software/dependenciesmanual.pdf.
Estival, Dominique et al. (2007). “ Author profiling for English emails”. In: Proceedings of the 10th Conference of the Pacific Association for Computational Linguistics (PACLING07), pp. 263–272.
Fan, Rong-En et al. (2008). “LIBLINEAR: A library for large linear classification”. In: The Journal of Machine Learning Research 9, pp. 1871–1874.Google Scholar
Geertzen, Jeroen, Theodora, Alexopoulou, and Anna, Korhonen (2013). “Automatic linguistic annotation of large scale L2 databases: The EF-Cambridge Open Language Database (EFCAMDAT)”. In: in Proceedings of the 31st Second Language Research Forum (SLRF), Carnegie Mellon. Cascadillla Proceedings Project.
Graesser, Arthur C et al. (2004). “Coh-Metrix: Analysis of text on cohesion and language”. In: Behavior Research Methods, Instruments, & Computers 36.2, pp. 193– 202.CrossRef
Granger, Sylviane. (2003). “The international corpus of learner English: a new resource for foreign language learning and teaching and second language acquisition research”. In: Tesol Quarterly, 37.3, pp. 538–546.CrossRef
Ionescu, Radu Tudor, Marius, Popescu, and Aoife, Cahill (2014). “Can characters reveal your native language? A language-independent approach to native language identification”. In: Proceedings of EMNLP, Octombrie.
Jarvis, Scott (2011). “Data mining with learner corpora”. In: A Taste for Corpora: In Honour of Sylviane Granger 45.
Jarvis, Scott, Yves, Bestgen, and Steve, Pepper (2013). “Maximizing Classification Accuracy in Native Language Identification”. In: NAACL/HLT 2013, p. 111.
Jarvis, Scott, and Scott, A Crossley (2012). Approaching Language Transfer through Text Classification: Explorations in the Detectionbased Approach. Multilingual Matters.
Joachims, Thorsten (1998). Text categorization with support vector machines: Learning with many relevant features. Springer.Google Scholar
King, Tracy Halloway. (1995). Configuring Topic and Focus in Russian. CSLI Publications.Google Scholar
Klein, Dan, and Christopher, D Manning (2003). “Accurate unlexicalized parsing”. In: Proceedings of the 41st Annual Meeting on Association for Computational Linguistics-Volume 1. Association for Computational Linguistics, pp. 423–430.CrossRef
Kochmar, Ekaterina (2011). “Identification of a writer's native language by error analysis”. Ph.D. thesis, Master's thesis, University of Cambridge.
Koppel, Moshe, Jonathan, Schler, and Kfir, Zigdon (2005). “Automatically determining an anonymous authors native language”. In: Intelligence and Security Informatics. Springer, pp. 209–217.Google Scholar
Lardiere, Donna (1998). “Case and tense in the ‘fossilized’ steady state”. In: Second Language Research, 14, pp. 1–26.CrossRefGoogle Scholar
Lavergne, Thomas, et al. (2013). “LIMSIs Participation in the 2013 Shared Task on Native Language Identification”. In: NAACL/HLT 2013, p. 260.
Malmasi, Shervin and Mark, Dras (2014a). “Arabic Native Language Identification”. In: ANLP 2014, p. 180. – (2014b). “Chinese Native Language Identification”. In: EACL 2014, p. 95.
Mitchell, Marcus, Ann, Taylor, Robert, MacIntyre (2012). Alphabetical list of part-of-speech tags used in the Penn Treebank Project. URL: http://www.ling.upenn.edu/courses/Fall_2003/ling001/penn_treebank_pos.html (visited on 05/30/ 2013).
Popescu, Marius and Radu, Tudor Ionescu (2013). “The Story of the Characters, the DNA and the Native Language”. In: NAACL/HLT 2013, p. 270.
Swanson, Ben and Eugene, Charniak (2012). “Native language detection with tree substitution grammars”. In: Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics: Short Papers-Volume 2. Association for Computational Linguistics, pp. 193–197.
Tetreault, Joel, Daniel, Blanchard, and Aoife, Cahill (2013). “A report on the first native language identification shared task”. In: NAACL/HLT 2013, p. 48.
Tetreault, Joel R et al. (2012). “Native Tongues, Lost and Found: Resources and Empirical Evaluations in Native Language Identification”. In: COLING, pp. 2585–2602.
Tomokiyo, Laura Mayfield and Rosie, Jones (2001). “You're not from'round here, are you?: naive Bayes detection of non-native utterance text”. In: Proceedings of the second meeting of the North American Chapter of the Association for Computational Linguistics on Language technologies. Association for Computational Linguistics, pp. 1–8.
Tsur, Oren and Ari, Rappoport (2007). “Using classifier features for studying the effect of native language on the choice of written second language words”. In: Proceedings of the Workshop on Cognitive Aspects of Computational Language Acquisition. Association for Computational Linguistics, pp. 9–16.CrossRef
Tsvetkov, Yulia et al. (2013). “Identifying the L1 of non-native writers: the CMU-Haifa system”. In: NAACL/HLT 2013, p. 279.
Vapnik, V. N. (1998). Statistical learning theory. Adaptive and learning systems for signal processing, communications, and control. Wiley, pp. 437–438.
Wong, Sze-Meng Jojo, and Mark Dras (2009). “Contrastive analysis and native language identification”. In: Proceedings of the Australasian Language Technology Association Workshop. Citeseer, pp. 53–61.
Wong, Sze-Meng Jojo (2011). “Exploiting parse structures for native language identification”. In: Proceedings of the Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics. pp. 1600–1610.
Wong, Sze-Meng Jojo, Mark, Dras, and Mark, Johnson (2011). “Topic modeling for native language identification”. In: Proceedings of the Australasian Language Technology Association Workshop, pp. 115–124.
Wong, Sze-Meng Jojo (2012). “Exploring adaptor grammars for native language identification”. In: Proceedings of the 2012 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning. Association for Computational Linguistics, pp. 699–709.
Yang, Yiming and Jan, O Pedersen (1997). “A comparative study on feature selection in text categorization”. In: International Conference on Machine Learning. Morgan Kaufmann Publishers, Inc., pp. 412–420.Google Scholar

Save book to Kindle

To save this book to your Kindle, first ensure [email protected] is added to your Approved Personal Document E-mail List under your Personal Document Settings on the Manage Your Content and Devices page of your Amazon account. Then enter the ‘name’ part of your Kindle email address below. Find out more about saving to your Kindle.

Note you can select to save to either the @free.kindle.com or @kindle.com variations. ‘@free.kindle.com’ emails are free but can only be saved to your device when it is connected to wi-fi. ‘@kindle.com’ emails can be delivered even when you are not connected to wi-fi, but note that service fees apply.

Find out more about the Kindle Personal Document Service.

  • Native Language Identification on EFCAMDAT
    • By Xiao Jiang, Computer Laboratory, University of Cambridge, UK, Yan Huang, Department of Theoretical and Applied Linguistics, University of Cambridge, UK, Yufan Guo, IBM Research, USA, Jeroen Geertzen, Department of Theoretical and Applied Linguistics, University of Cambridge, UK, Theodora Alexopoulou, Department of Theoretical and Applied Linguistics, University of Cambridge, UK, Lin Sun, Greedy Intelligence, China, Anna Korhonen, Department of Theoretical and Applied Linguistics, University of Cambridge, UK
  • Edited by Thierry Poibeau, Centre National de la Recherche Scientifique (CNRS), Paris, Aline Villavicencio, Universidade Federal do Rio Grande do Sul, Brazil
  • Book: Language, Cognition, and Computational Models
  • Online publication: 30 November 2017
  • Chapter DOI: https://doi.org/10.1017/9781316676974.007
Available formats
×

Save book to Dropbox

To save content items to your account, please confirm that you agree to abide by our usage policies. If this is the first time you use this feature, you will be asked to authorise Cambridge Core to connect with your account. Find out more about saving content to Dropbox.

  • Native Language Identification on EFCAMDAT
    • By Xiao Jiang, Computer Laboratory, University of Cambridge, UK, Yan Huang, Department of Theoretical and Applied Linguistics, University of Cambridge, UK, Yufan Guo, IBM Research, USA, Jeroen Geertzen, Department of Theoretical and Applied Linguistics, University of Cambridge, UK, Theodora Alexopoulou, Department of Theoretical and Applied Linguistics, University of Cambridge, UK, Lin Sun, Greedy Intelligence, China, Anna Korhonen, Department of Theoretical and Applied Linguistics, University of Cambridge, UK
  • Edited by Thierry Poibeau, Centre National de la Recherche Scientifique (CNRS), Paris, Aline Villavicencio, Universidade Federal do Rio Grande do Sul, Brazil
  • Book: Language, Cognition, and Computational Models
  • Online publication: 30 November 2017
  • Chapter DOI: https://doi.org/10.1017/9781316676974.007
Available formats
×

Save book to Google Drive

To save content items to your account, please confirm that you agree to abide by our usage policies. If this is the first time you use this feature, you will be asked to authorise Cambridge Core to connect with your account. Find out more about saving content to Google Drive.

  • Native Language Identification on EFCAMDAT
    • By Xiao Jiang, Computer Laboratory, University of Cambridge, UK, Yan Huang, Department of Theoretical and Applied Linguistics, University of Cambridge, UK, Yufan Guo, IBM Research, USA, Jeroen Geertzen, Department of Theoretical and Applied Linguistics, University of Cambridge, UK, Theodora Alexopoulou, Department of Theoretical and Applied Linguistics, University of Cambridge, UK, Lin Sun, Greedy Intelligence, China, Anna Korhonen, Department of Theoretical and Applied Linguistics, University of Cambridge, UK
  • Edited by Thierry Poibeau, Centre National de la Recherche Scientifique (CNRS), Paris, Aline Villavicencio, Universidade Federal do Rio Grande do Sul, Brazil
  • Book: Language, Cognition, and Computational Models
  • Online publication: 30 November 2017
  • Chapter DOI: https://doi.org/10.1017/9781316676974.007
Available formats
×