Exploiting native language interference for native language identification

Ilia Markov; Vivi Nastase; Carlo Strapparava

doi:10.1017/S1351324920000595

Exploiting native language interference for native language identification

Published online by Cambridge University Press: 26 November 2020

Ilia Markov

Vivi Nastase and

Carlo Strapparava

Show author details

Ilia Markov*: Affiliation:
University of Antwerp, CLiPS, Antwerp, Belgium
Vivi Nastase: Affiliation:
University of Stuttgart, Stuttgart, Germany
Carlo Strapparava: Affiliation:
FBK-irst, Fondazione Bruno Kessler, Trento, Italy
*: *Corresponding author. E-mail: [email protected]

Article contents

Abstract
References

Get access

Rights & Permissions

Abstract

Native language identification (NLI)—the task of automatically identifying the native language (L1) of persons based on their writings in the second language (L2)—is based on the hypothesis that characteristics of L1 will surface and interfere in the production of texts in L2 to the extent that L1 is identifiable. We present an in-depth investigation of features that model a variety of linguistic phenomena potentially involved in native language interference in the context of the NLI task: the languages’ structuring of information through punctuation usage, emotion expression in language, and similarities of form with the L1 vocabulary through the use of anglicized words, cognates, and other misspellings. The results of experiments with different combinations of features in a variety of settings allow us to quantify the native language interference value of these linguistic phenomena and show how robust they are in cross-corpus experiments and with respect to proficiency in L2. These experiments provide a deeper insight into the NLI task, showing how native language interference explains the gap between baseline, corpus-independent features, and the state of the art that relies on features/representations that cover (indiscriminately) a variety of linguistic phenomena.

Keywords

Native language interference native language identification punctuation emotions cognates

Type: Article
Information: Natural Language Engineering , Volume 28 , Issue 2 , March 2022 , pp. 167 - 197

DOI: https://doi.org/10.1017/S1351324920000595 [Opens in a new window]
Copyright: © The Author(s), 2020. Published by Cambridge University Press

Access options

Get access to the full version of this content by using one of the access options below. (Log in options will check for institutional or personal access. Content may require purchase if you do not have access.)

Article purchase

Temporarily unavailable

References

Baron, N. (2001). Commas and canaries: The role of punctuation in speech and writing. Language Sciences 23(1), 15–67.CrossRef Google Scholar

Bergsma, S. and Kondrak, G. (2007). Alignment-based discriminative string similarity. In Proceedings of the 45th Annual Meeting of the Association of Computational Linguistics. Prague, Czech Republic: ACL, pp. 656–663.Google Scholar

Blanchard, D., Tetreault, J., Higgins, D., Cahill, A. and Chodorow, M. (2013). TOEFL11: A corpus of non-native English. ETS Research Report Series 2013(2), i–15.CrossRef Google Scholar

Brooke, J. and Hirst, G. (2011). Native language detection with ‘cheap’ learner corpora. In Proceedings of the Conference of Learner Corpus Research. Louvain-la-Neuve, Belgium: Presses universitaires de Louvain, pp. 37–47.Google Scholar

Brooke, J. and Hirst, G. (2012). Robust, lexicalized native language identification. In Proceedings of the 24th International Conference on Computational Linguistics. Mumbai, India: The COLING 2012 Organizing Committee, pp. 391–408.Google Scholar

Bruthiaux, P. (1993). Knowing when to stop: Investigating the nature of punctuation. Language and Communication 13(1), 27–43.CrossRef Google Scholar

Caldwell-Harris, C. (2014). Emotionality differences between a native and foreign language: Theoretical implications. Frontiers in Psychology 5(1055), 1–4.CrossRef Google Scholar PubMed

Chaski, C. (2001). Empirical evaluations of language-based author identification techniques. Forensic Linguistics 8(1), 1–65.Google Scholar

Chen, L. (2016). Native Language Identification on Learner Corpora. M.Phil. Thesis, University of Trento, Department of Information Engineering and Science, Trento, Italy.Google Scholar

Chen, L., Strapparava, C. and Nastase, V. (2017). Improving native language identification by using spelling errors. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics. Vancouver, Canada: ACL, pp. 542–546.CrossRef Google Scholar

Cimino, A. and Dell’Orletta, F. (2017). Stacked sentence-document classifier approach for improving native language identification. In Proceedings of the 12th Workshop on Building Educational Applications Using NLP. Copenhagen, Denmark: ACL, pp. 430–437.CrossRef Google Scholar

de Melo, G. and Weikum, G. (2010). Towards universal multilingual knowledge bases. In Principles, Construction, and Applications of Multilingual Wordnets. Proceedings of the 5th Global WordNet Conference. Mumbai, India: Narosa Publishing House, pp. 149–156.Google Scholar

Flanagan, B. and Hirokawa, S. (2018). An automatic method to extract online foreign language learner writing error characteristics. International Journal of Distance Education Technologies 16(4), 15–30.CrossRef Google Scholar

Franco-Salvador, M., Kondrak, G. and Rosso, P. (2017). Bridging the native language and language variety identification tasks. Procedia Computer Science 112, 1554–1561.CrossRef Google Scholar

Goldin, G., Rabinovich, E. and Wintner, S. (2018). Native language identification with user generated content. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing. Brussels, Belgium: ACL, pp. 3591–3601.CrossRef Google Scholar

Gómez-Adorno, H., Bel-Enguix, G., Sierra, G., Sánchez, O. and Quezada, D. (2018). A machine learning approach for detecting aggressive tweets in Spanish. In Proceedings of the Third Workshop on Evaluation of Human Language Technologies for Iberian Languages, vol. 2150. Seville, Spain: CEUR-WS.org, pp. 97–101.Google Scholar

Granger, S., Dagneaux, E., Meunier, F. and Paquot, M. (2009). International Corpus of Learner English v2 (ICLE). Louvain-la-Neuve, Belgium: Presses Universitaires de Louvain.Google Scholar

Grieve, J. (2007). Quantitative authorship attribution: An evaluation of techniques. Literary and Linguistic Computing 22(3), 251–270.CrossRef Google Scholar

Hirvela, A., Nussbaum, A. and Pierson, H. (2012). ESL students’ attitudes toward punctuation. System 40(1), 11–23.CrossRef Google Scholar

Ionescu, R.T. and Popescu, M. (2017). Can string kernels pass the test of time in native language identification? In Proceedings of the 12th Workshop on Building Educational Applications Using NLP. Copenhagen, Denmark: ACL, pp. 224–234.Google Scholar

Ionescu, R.T., Popescu, M. and Cahill, A. (2014). Can characters reveal your native language? A language-independent approach to native language identification. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing. Doha, Qatar: ACL, pp. 1363–1373.CrossRef Google Scholar

Jarvis, S., Bestgen, Y. and Pepper, S. (2013). Maximizing classification accuracy in native language identification. In Proceedings of the Eighth Workshop on Innovative Use of NLP for Building Educational Applications. Atlanta, GA, USA: ACL, pp. 111–118.Google Scholar

Kestemont, M. (2014). Function words in authorship attribution. From black magic to theory? In Proceedings of the 3rd Workshop on Computational Linguistics for Literature. Gothenburg, Sweden: ACL, pp. 59–66.Google Scholar

Koppel, M., Schler, J. and Zigdon, K. (2005). Determining an author’s native language by mining a text for errors. In Proceedings of the Eleventh ACM SIGKDD International Conference on Knowledge Discovery in Data Mining. New York, NY, USA: ACM, pp. 624–628.CrossRef Google Scholar

Kumar, A., Ganesh, B., Ajay, S. and Soman, P. (2018). Overview of the second shared task on Indian native language identification (INLI). In Working notes of FIRE 2018 - Forum for Information Retrieval Evaluation, vol. 2266. Gandhinagar, India: CEUR Workshop Proceedings, pp. 39–50.Google Scholar

Kumar, A., Ganesh, B., Singh, S., Soman, P. and Rosso, P. (2017). Overview of the INLI PAN at FIRE-2017 track on Indian native language identification. In Working notes of FIRE 2017 - Forum for Information Retrieval Evaluation, vol. 2036. Bangalore, India: CEUR Workshop Proceedings, pp. 99–105.Google Scholar

Leersnyder, J.D., Mesquita, B. and Kim, H.S. (2011). Where do my emotions belong? A study of immigrants’ emotional acculturation. Personality and Social Psychology Bulletin 37(4), 451–463.CrossRef Google Scholar

Levenshtein, V. (1966). Binary codes capable of correcting deletions, insertions, and reversals. Soviet Physics Doklady 10(8), 707–710.Google Scholar

Malmasi, S. and Dras, M. (2015). Multilingual native language identification. Natural Language Engineering 23(2), 163–215.CrossRef Google Scholar

Malmasi, S., Evanini, K., Cahill, A., Tetreault, J., Pugh, R., Hamill, C., Napolitano, D. and Qian, Y. (2017). A report on the 2017 native language identification shared task. In Proceedings of the 12th Workshop on Building Educational Applications Using NLP. Copenhagen, Denmark: ACL, pp. 62–75.CrossRef Google Scholar

Mann, G. and Yarowsky, D. (2001). Multipath translation lexicon induction via bridge languages. In Proceedings of the Second Meeting of the North American Chapter of the Association for Computational Linguistics. Pittsburgh, PA, USA: ACL, pp. 151–158.CrossRef Google Scholar

Markov, I., Chen, L., Strapparava, C. and Sidorov, G. (2017). CIC-FBK approach to native language Identification. In Proceedings of the 12th Workshop on Building Educational Applications Using NLP. Copenhagen, Denmark: ACL, pp. 374–381.CrossRef Google Scholar

Markov, I. and Sidorov, G. (2018). CIC-IPN@INLI2018: Indian native language identification. In Working Notes of FIRE 2018 - Forum for Information Retrieval Evaluation, vol. 2266. Gandhinagar, India: CEUR Workshop Proceedings, pp. 82–88.Google Scholar

Markov, I., Stamatatos, E. and Sidorov, G. (2018). Improving cross-topic authorship attribution: The role of pre-processing. In Proceedings of the 18th International Conference on Computational Linguistics and Intelligent Text Processing, vol. 10762. Budapest, Hungary: Springer, pp. 289–302.CrossRef Google Scholar

McNemar, Q. (1947). Note on the sampling error of the difference between correlated proportions or percentages. Psychometrika 12(2), 153–157.CrossRef Google Scholar PubMed

Mohammad, S. and Turney, P. (2013). Crowdsourcing a word-emotion association lexicon. Computational Intelligence 29, 436–465.CrossRef Google Scholar

Moore, N. (2016). What’s the point? The role of punctuation in realising information structure in written English. Functional Linguistics 3(1), 6.Google Scholar

Newman, M., Pennebaker, J., Berry, D. and Richards, J. (2003). Lying words: Predicting deception from linguistic styles. Personality and Social Psychology Bulletin 29(5), 665–675.CrossRef Google Scholar PubMed

Nicolai, G., Hauer, B., Salameh, M., Yao, L. and Kondrak, G. (2013). Cognate and misspelling features for natural language identification. In Proceedings of the Eighth Workshop on Innovative Use of NLP for Building Educational Applications. Atlanta, GA, USA: ACL, pp. 140–145.Google Scholar

Odlin, T. (1989). Language Transfer: Cross-Linguistic Influence in Language Learning. Cambridge, UK: Cambridge University Press.CrossRef Google Scholar

Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., Blondel, M., Prettenhofer, P., Weiss, R., Dubourg, V., Vanderplas, J., Passos, A., Cournapeau, D., Brucher, M., Perrot, M. and Duchesnay, É. (2011). Scikit-learn: Machine learning in python. Journal of Machine Learning Research 12, 2825–2830.Google Scholar

Pennebaker, J., Booth, R. and Francis, M. (2007). Linguistic Inquiry and Word Count: LIWC2007. Austin, TX: LIWC.net.Google Scholar

Rabinovich, E., Tsvetkov, Y. and Wintner, S. (2018). Native language cognate effects on second language lexical choice. Transactions of the Association for Computational Linguistics 6, 329–342.CrossRef Google Scholar

Rangel, F. and Rosso, P. (2013). On the identification of emotions and authors’ gender in facebook comments on the basis of their writing style. In Proceedings of the First International Workshop on Emotion and Sentiment in Social and Expressive Media: Approaches and perspectives from AI, vol. 1096. Torino, Italy: CEUR-WS.org, pp. 34–46.Google Scholar

Rangel, F. and Rosso, P. (2016). On the impact of emotions on author profiling. Information Processing & Management 52(1), 74–92.CrossRef Google Scholar

Rangel, F., Rosso, P., Brooke, J. and Uitdenbogerd, A. (2018). Cross-corpus native language identification via statistical embedding. In Proceedings of the Second Workshop on Stylistic Variation. New Orleans, LA, USA: ACL, pp. 39–43.CrossRef Google Scholar

Schmid, H. (1999). Improvements in Part-of-Speech Tagging With an Application to German. Springer. pp. 13–25.Google Scholar

Sidorov, G., Miranda-Jiménez, S., Viveros-Jiménez, F., Gelbukh, A., Castro-Sánchez, N., Velásquez, F., Díaz-Rangel, I., Suárez-Guerra, S., Treviño, A. and Gordon, J. (2013). Empirical study of machine learning based approach for opinion mining in tweets. In Proceedings of the Mexican International Conference on Artificial Intelligence, vol. 7629. San Luis Potosí. Mexico: Springer, pp. 1–14.CrossRef Google Scholar

Smith, T. and Witten, I. (1993). Language inference from function words. Tech. rept. 93/3. Department of Computer Science, University of Waikato. Computer Science Working Papers.Google Scholar

Solorio, T., Blair, E., Maharjan, S., Bethard, S., Diab, M., Ghoneim, M., Hawwari, A., AlGhamdi, F., Hirschberg, J., Chang, A. and Fung, P. (2014). Overview for the first shared task on language identification in code-switched data. In Proceedings of the First Workshop on Computational Approaches to Code Switching. Doha, Qatar: ACL, pp. 62–72.CrossRef Google Scholar

Tetreault, J., Blanchard, D., Cahill, A. and Chodorow, M. (2012). Native tongues, lost and found: Resources and empirical evaluations in native language identification. In Proceedings of the 24th International Conference on Computational Linguistics. Mumbai, India: The COLING 2012 Organizing Committee, pp. 2585–2602.Google Scholar

Tetreault, J., Blanchard, D. and Cahill, A. (2013). A report on the first native language identification shared task. In Proceedings of the Eighth Workshop on Innovative Use of NLP for Building Educational Applications. Atlanta, GA, USA: ACL, pp. 48–57.Google Scholar

Torney, R., Vamplew, P. and Yearwood, J. (2012). Using psycholinguistic features for profiling first language of authors. Journal of the American Society for Information Science and Technology 63(6), 1256–1269.CrossRef Google Scholar

Volkova, S., Ranshous, S. and Phillips, L. (2018). Predicting foreign language usage from English-only social media posts. In Proceedings of the 16th Annual Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. New Orleans, LA, USA: ACL, pp. 608–614.CrossRef Google Scholar

Wierzbicka, A. (1999). Emotions Across Languages and Cultures: Diversity and Universals. Cambridge: Cambridge University Press.CrossRef Google Scholar

Article contents

Exploiting native language interference for native language identification

Abstract

Keywords

Access options

Article purchase

Temporarily unavailable

References

Save article to Kindle

Save article to Dropbox

Save article to Google Drive

Reply to: Submit a response

Your details

You have entered the maximum number of contributors

Conflicting interests