Native Language Identification on EFCAMDAT

doi:10.1017/9781316676974.007

7 - Native Language Identification on EFCAMDAT

from Part III - Data Driven Models

Published online by Cambridge University Press: 30 November 2017

Xiao Jiang ,

Yan Huang ,

Yufan Guo ,

Jeroen Geertzen ,

Theodora Alexopoulou ,

Lin Sun and

Anna Korhonen

Edited by

Thierry Poibeau and

Aline Villavicencio

Show author details

Xiao Jiang: Affiliation:
Computer Laboratory, University of Cambridge, UK
Yan Huang: Affiliation:
Department of Theoretical and Applied Linguistics, University of Cambridge, UK
Yufan Guo: Affiliation:
IBM Research, USA
Jeroen Geertzen: Affiliation:
Department of Theoretical and Applied Linguistics, University of Cambridge, UK
Theodora Alexopoulou: Affiliation:
Department of Theoretical and Applied Linguistics, University of Cambridge, UK
Lin Sun: Affiliation:
Greedy Intelligence, China
Anna Korhonen: Affiliation:
Department of Theoretical and Applied Linguistics, University of Cambridge, UK
Thierry Poibeau: Affiliation:
Centre National de la Recherche Scientifique (CNRS), Paris
Aline Villavicencio: Affiliation:
Universidade Federal do Rio Grande do Sul, Brazil

Book contents

Get access

Summary

Abstract

Native Language Identification (NLI) is a task aimed at determining the native language (L1) of learners of second language (L2) on the basis of their written texts. To date, research on NLI has focused on relatively small corpora. We apply NLI to EFCAMDAT, an L2 English learner corpus that is not only multiple times larger than previous L2 corpora but also provides pseudo-longitudinal data across several proficiency levels. Based on accurate machine learning with a wide range of linguistic features, our investigation reveals interesting patterns in the longitudinal data that are useful for both further development of NLI and its application to research on L2 acquisition.

Introduction

Native language identification (NLI) is a task aimed at detecting the native language (L1) of writers on the basis of their second language (L2) production. NLI is important for natural language processing (NLP) applications including language tutoring systems and authorship profiling. Moreover, NLI can offer useful empirical data for research on L2 acquisition. For example, NLI can shed light on how L1 background influences L2 learning, and on differences between the writings of L2 learners across different L1 backgrounds.

To date, studies on NLI have focused on relatively small learner corpora. Furthermore, none of them have investigated the influence of L1s across L2 proficiency levels. Our work takes the first step toward addressing these problems. We apply NLI to EFCAMDAT, the EF-Cambridge Open Language Database (Geertzen, Alexopoulou, and Korhonen, 2013), an open-access L2 learner corpus.

EFCAMDAT consists of writings of learners submitted to Englishtown, the online school of EF. EFCAMDAT stands out for its size, diversity of student backgrounds, and coverage of the proficiency levels. The first release of 2013 (Geertzen, Alexopoulou, and Korhonen, 2013), on which this paper is based, amounts to 30 million words, a corpus multiple times larger than any other available L2 corpora. Using a standard machine learning–based methodology for NLI, we explore the optimal linguistic features for NLI on this data at different proficiency levels. We discover interesting patterns that can be useful for both further development of NLI and its application to research on L2 acquisition.

In this introductory section, we first review the history of research on NLI, and introduce the data sets that have been used in earlier NLI research.We then summarise our contribution briefly.

Type: Chapter
Information: Language, Cognition, and Computational Models , pp. 159 - 184

DOI: https://doi.org/10.1017/9781316676974.007 [Opens in a new window]

Publisher: Cambridge University Press

Print publication year: 2018

Access options

Get access to the full version of this content by using one of the access options below. (Log in options will check for institutional or personal access. Content may require purchase if you do not have access.)

Book purchase

Temporarily unavailable

References

Ahn, Charles S. (2011). “Automatically detecting authors’ native language”. Ph.D. thesis, Monterey, California. Naval Postgraduate School.

Al-Rfou, Rami. (2012). “Detecting English Writing Styles For Non-native Speakers”. In: arXiv preprint arXiv:1211.0498.

Bestgen, Yves, Sylviane, Granger, Jennifer, Thewissen, et al. (2012). “Error patterns and automatic L1 identification”. In: Approaching language transfer through text classification, pp. 127–153.

Bishop, C. M. (2006). Pattern Recognition and Machine Learning. Information Science and Statistics. Springer, pp. 338–339.Google Scholar

Blanchard, Daniel, et al. (2013). “TOEFL11: A corpus of non-native English”. In: Educational Testing Service.

Brooke, Julian, and Graeme, Hirst (2011). ’Native language detection with cheap learner corpora. In: Conference of Learner Corpus Research (LCR2011).

Bykh, Serhiy, and Detmar Meurers (2012). “Native Language Identification Using Recurring N-grams–Investigating Abstraction and Domain Dependence”. In: Proceedings of COLING 2012: Technical Papers, pp. 425–440.

Charniak, Eugene, and Johnson, Mark. (2005). “Coarse-to-fine n-best parsing and Max-Ent discriminative reranking”. In: Proceedings of the 43rd Annual Meeting on Association for Computational Linguistics. Association for Computational Linguistics, pp. 173–180.

De Marneffe, Marie-Catherine, Bill, MacCartney, and Christopher, D Manning (2006). “Generating typed dependency parses from phrase structure parses”. In: Proceedings of LREC, Vol. 6, pp. 449–454.

De Marneffe, Marie-Catherine, and Christopher, D Manning (2008). “Stanford typed dependencies manual”. In: URL http://nlp.stanford.edu/software/dependenciesmanual.pdf.

Estival, Dominique et al. (2007). “ Author profiling for English emails”. In: Proceedings of the 10th Conference of the Pacific Association for Computational Linguistics (PACLING07), pp. 263–272.

Fan, Rong-En et al. (2008). “LIBLINEAR: A library for large linear classification”. In: The Journal of Machine Learning Research 9, pp. 1871–1874.Google Scholar

Geertzen, Jeroen, Theodora, Alexopoulou, and Anna, Korhonen (2013). “Automatic linguistic annotation of large scale L2 databases: The EF-Cambridge Open Language Database (EFCAMDAT)”. In: in Proceedings of the 31st Second Language Research Forum (SLRF), Carnegie Mellon. Cascadillla Proceedings Project.

Graesser, Arthur C et al. (2004). “Coh-Metrix: Analysis of text on cohesion and language”. In: Behavior Research Methods, Instruments, & Computers 36.2, pp. 193– 202.CrossRef

Granger, Sylviane. (2003). “The international corpus of learner English: a new resource for foreign language learning and teaching and second language acquisition research”. In: Tesol Quarterly, 37.3, pp. 538–546.CrossRef

Ionescu, Radu Tudor, Marius, Popescu, and Aoife, Cahill (2014). “Can characters reveal your native language? A language-independent approach to native language identification”. In: Proceedings of EMNLP, Octombrie.

Jarvis, Scott (2011). “Data mining with learner corpora”. In: A Taste for Corpora: In Honour of Sylviane Granger 45.

Jarvis, Scott, Yves, Bestgen, and Steve, Pepper (2013). “Maximizing Classification Accuracy in Native Language Identification”. In: NAACL/HLT 2013, p. 111.

Jarvis, Scott, and Scott, A Crossley (2012). Approaching Language Transfer through Text Classification: Explorations in the Detectionbased Approach. Multilingual Matters.

Joachims, Thorsten (1998). Text categorization with support vector machines: Learning with many relevant features. Springer.Google Scholar

King, Tracy Halloway. (1995). Configuring Topic and Focus in Russian. CSLI Publications.Google Scholar

Klein, Dan, and Christopher, D Manning (2003). “Accurate unlexicalized parsing”. In: Proceedings of the 41st Annual Meeting on Association for Computational Linguistics-Volume 1. Association for Computational Linguistics, pp. 423–430.CrossRef

Kochmar, Ekaterina (2011). “Identification of a writer's native language by error analysis”. Ph.D. thesis, Master's thesis, University of Cambridge.

Koppel, Moshe, Jonathan, Schler, and Kfir, Zigdon (2005). “Automatically determining an anonymous authors native language”. In: Intelligence and Security Informatics. Springer, pp. 209–217.Google Scholar

Lardiere, Donna (1998). “Case and tense in the ‘fossilized’ steady state”. In: Second Language Research, 14, pp. 1–26.CrossRef Google Scholar

Lavergne, Thomas, et al. (2013). “LIMSIs Participation in the 2013 Shared Task on Native Language Identification”. In: NAACL/HLT 2013, p. 260.

Malmasi, Shervin and Mark, Dras (2014a). “Arabic Native Language Identification”. In: ANLP 2014, p. 180. – (2014b). “Chinese Native Language Identification”. In: EACL 2014, p. 95.

Mitchell, Marcus, Ann, Taylor, Robert, MacIntyre (2012). Alphabetical list of part-of-speech tags used in the Penn Treebank Project. URL: http://www.ling.upenn.edu/courses/Fall_2003/ling001/penn_treebank_pos.html (visited on 05/30/ 2013).

Popescu, Marius and Radu, Tudor Ionescu (2013). “The Story of the Characters, the DNA and the Native Language”. In: NAACL/HLT 2013, p. 270.

Swanson, Ben and Eugene, Charniak (2012). “Native language detection with tree substitution grammars”. In: Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics: Short Papers-Volume 2. Association for Computational Linguistics, pp. 193–197.

Tetreault, Joel, Daniel, Blanchard, and Aoife, Cahill (2013). “A report on the first native language identification shared task”. In: NAACL/HLT 2013, p. 48.

Tetreault, Joel R et al. (2012). “Native Tongues, Lost and Found: Resources and Empirical Evaluations in Native Language Identification”. In: COLING, pp. 2585–2602.

Tomokiyo, Laura Mayfield and Rosie, Jones (2001). “You're not from'round here, are you?: naive Bayes detection of non-native utterance text”. In: Proceedings of the second meeting of the North American Chapter of the Association for Computational Linguistics on Language technologies. Association for Computational Linguistics, pp. 1–8.

Tsur, Oren and Ari, Rappoport (2007). “Using classifier features for studying the effect of native language on the choice of written second language words”. In: Proceedings of the Workshop on Cognitive Aspects of Computational Language Acquisition. Association for Computational Linguistics, pp. 9–16.CrossRef

Tsvetkov, Yulia et al. (2013). “Identifying the L1 of non-native writers: the CMU-Haifa system”. In: NAACL/HLT 2013, p. 279.

Vapnik, V. N. (1998). Statistical learning theory. Adaptive and learning systems for signal processing, communications, and control. Wiley, pp. 437–438.

Wong, Sze-Meng Jojo, and Mark Dras (2009). “Contrastive analysis and native language identification”. In: Proceedings of the Australasian Language Technology Association Workshop. Citeseer, pp. 53–61.

Wong, Sze-Meng Jojo (2011). “Exploiting parse structures for native language identification”. In: Proceedings of the Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics. pp. 1600–1610.

Wong, Sze-Meng Jojo, Mark, Dras, and Mark, Johnson (2011). “Topic modeling for native language identification”. In: Proceedings of the Australasian Language Technology Association Workshop, pp. 115–124.

Wong, Sze-Meng Jojo (2012). “Exploring adaptor grammars for native language identification”. In: Proceedings of the 2012 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning. Association for Computational Linguistics, pp. 699–709.

Yang, Yiming and Jan, O Pedersen (1997). “A comparative study on feature selection in text categorization”. In: International Conference on Machine Learning. Morgan Kaufmann Publishers, Inc., pp. 412–420.Google Scholar