Hostname: page-component-78c5997874-s2hrs Total loading time: 0 Render date: 2024-11-19T09:20:57.741Z Has data issue: false hasContentIssue false

SwitchNet: Learning to switch for word-level language identification in code-mixed social media text

Published online by Cambridge University Press:  03 June 2021

Neelakshi Sarma*
Affiliation:
Department of Computer Science and Engineering, Indian Institute of Technology Guwahati, Guwahati, 781039, India
Ranbir Sanasam Singh
Affiliation:
Department of Computer Science and Engineering, Indian Institute of Technology Guwahati, Guwahati, 781039, India
Diganta Goswami
Affiliation:
Department of Computer Science and Engineering, Indian Institute of Technology Guwahati, Guwahati, 781039, India
*
*Corresponding author. E-mail: [email protected]

Abstract

Word-level language identification is an essential prerequisite for extracting useful information from code-mixed social media content. Previous studies in word-level language identification show two important observations. First, the local context is an important indicator of the language of a word when a word is valid in multiple languages. Second, considering the word in isolation from its context leads to more effective language classification when a word is borrowed or embedded into sentences of other languages. In this paper, we propose a framework for language identification that makes use of a dynamic switching mechanism for effective language classification of both words that are borrowed or embedded from other languages as well as words that are valid in multiple languages. For a given input, the proposed switching mechanism makes a dynamic decision to bias its prediction either towards the prediction obtained by the contextual information or that obtained by the word in isolation. In contrast to existing studies that rely upon large amounts of annotated data for robust performance in a multilingual environment, the proposed approach uses minimal annotated resources and no external resources, making it easily extendible to newer languages. Evaluation over a corpus of transliterated Facebook comments shows that the proposed approach outperforms its baseline counterparts: classification based on the contextual information, classification based on the word in isolation, as well as an ensemble of the two classifiers.

Type
Article
Copyright
© The Author(s), 2021. Published by Cambridge University Press

Access options

Get access to the full version of this content by using one of the access options below. (Log in options will check for institutional or personal access. Content may require purchase if you do not have access.)

References

Abainia, K., Ouamour, S. and Sayoud, H. (2016) Effective language identification of forum texts based on statistical approaches. Information Processing & Management 52, 491512.CrossRefGoogle Scholar
Banerjee, S., Kuila, A., Roy, A., Naskar, S.K., Rosso, P. and Bandyopadhyay, S. (2014) A hybrid approach for transliterated word-level language identification: Crf with post-processing heuristics. In Proceedings of the 2014 Forum for Information Retrieval Evaluation, Bangalore, India, pp. 5459.Google Scholar
Barman, U., Das, A., Wagner, J. and Foster, J. (2014) Code mixing: A challenge for language identification in the language of social media. In Proceedings of The First Workshop on Computational Approaches to Code Switching, Doha, Qatar, pp. 1323.CrossRefGoogle Scholar
Bock, Z. (2013) Cyber socialising: Emerging genres and registers of intimacy among young south african students. Language Matters 44, 6891.CrossRefGoogle Scholar
Bullock, B., GuzmÑn, W., Serigos, J., Sharath, V. and Toribio, A.J. (2018) Predicting the presence of a matrix language in code-switching. In Proceedings of the Third Workshop on Computational Approaches to Linguistic Code-Switching, Melbourne, Australia, pp. 6875.CrossRefGoogle Scholar
Carter, S., Weerkamp, W. and Tsagkias, M. (2013) Microblog language identification: Overcoming the limitations of short, unedited and idiomatic text. Language Resources and Evaluation Journal 47(1), 195215.CrossRefGoogle Scholar
Cavnar, W.B. and Trenkle, J.M. (1994) N-gram-based text categorization. In Proceedings of Third Annual Symposium on Document Analysis and Information Retrieval, Las Vegas, USA, pp. 161175.Google Scholar
Chandu, K., Manzini, T., Singh, S. and Black, A.W. (2018) Language informed modeling of code-switched text. In Proceedings of the Third Workshop on Computational Approaches to Linguistic Code-Switching, New Orleans, Louisiana, pp. 9297.CrossRefGoogle Scholar
Chittaranjan, G., Vyas, Y., Bali, K. and Choudhury, M. (2014) Word-level language identification using CRF: Code-switching shared task report of MSR india system. In Proceedings of The First Workshop on Computational Approaches to Code Switching, Doha, Qatar, pp. 7379.CrossRefGoogle Scholar
Das, A. and Gamback, B. (2014) Identifying languages at the word level in code-mixed Indian social media text. In Proceedings of the International Conference on Natural Language Processing, Goa, India, pp. 378387.Google Scholar
Das, S.D., Mandal, S. and Das, D. 2019. Language identification of Bengali-English code-mixed data using character & phonetic based LSTM models. In Proceedings of the Forum for Information Retrieval Evaluation, Kolkata, India, pp. 6064.Google Scholar
Devlin, J., Chang, M.-W., Lee, K. and Toutanova, K. (2019) Bert: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics, Minneapolis, Minnesota, pp. 41714186.Google Scholar
Garg, A., Gupta, V. and Jindal, M. (2014) A survey of language identification techniques and applications. Journal of Emerging Technologies in Web Intelligence 6, 388400.Google Scholar
Gella, S., Bali, K. and Choudhury, M. (2014) ye word kis lang ka hai bhai? Testing the limits of word level language identification. In Proceedings of the International Conference on Natural Language Processing, Goa, India, pp. 368377.Google Scholar
Gundapu, S. and Mamidi, R. (2018) Word level language identification in english telugu code mixed data. In Proceedings of the Pacific Asia Conference on Language, Information and Computation, Hong Kong, pp. 180186.Google Scholar
Jaech, A., Mulcaire, G., Hathi, S., Ostendorf, M. and Smith, N.A. (2016) Hierarchical character-word models for language identification. In Proceedings of The Fourth International Workshop on Natural Language Processing for Social Media, Austin, TX, pp. 8493.CrossRefGoogle Scholar
Jauhiainen, T.S., Lui, M., Zampieri, M., Baldwin, T. and LindÉn, K. (2019) Automatic language identification in texts: A survey. Journal of Artificial Intelligence Research 65, 675782.CrossRefGoogle Scholar
Jurgens, D., Tsvetkov, Y. and Jurafsky, D. (2017) Incorporating dialectal variability for socially equitable language identification. In Proceedings of the Annual Meeting of the Association for Computational Linguistics, Vancouver, Canada, pp. 5157.CrossRefGoogle Scholar
King, B. and Abney, S. (2013) Labeling the languages of words in mixed-language documents using weakly supervised methods. In Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Atlanta, pp. 1110–1119.Google Scholar
Mager, M., Cetinoglu, O. and Kann, K. (2019) Subword-level language identification for intra-word code-switching. In Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Minneapolis, Minnesota, pp. 2005–2011.CrossRefGoogle Scholar
Mandal, S. and Singh, A.K. (2018) Language identification in code-mixed data using multichannel neural networks and context capture. In Proceedings of The Fourth Workshop on Noisy User-generated Text, Brussels, Belgium, pp. 116120.CrossRefGoogle Scholar
Mave, D., Maharjan, S. and Solorio, T. (2018) Language identification and analysis of code-switched social media text. In Proceedings of the Third Workshop on Computational Approaches to Linguistic Code-Switching, Melbourne, Australia.CrossRefGoogle Scholar
Miyamoto, Y. and Cho, K. (2016) Gated word-character recurrent language model. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, Texas, USA, pp. 1992–1997.CrossRefGoogle Scholar
Molina, G., Rey-Villamizar, N., Solorio, T., AlGhamdi, F., Ghoneim, M., Hawwari, A. and Diab, M. (2016) Overview for the second shared task on language identification in code-switched data. In Proceedings of the Second Workshop on Computational Approaches to Code Switching, Texas, USA, pp. 4049.CrossRefGoogle Scholar
Nguyen, D. and Cornips, L. (2016) Automatic detection of intra-word code-switching. In Proceedings of the Fourteenth SIGMORPHON Workshop on Computational Research in Phonetics, Phonology, and Morphology, Berlin, Germany, pp. 82–86.CrossRefGoogle Scholar
Nguyen, D. and Doğruöz, A.S. (2013) Word level language identification in online multilingual communication. In Proceedings of the Conference on Empirical Methods in Natural Language Processing, Washington, USA, pp. 857862.Google Scholar
Papalexakis, E., Nguyen, D. and Doğruöz, A.S. (2014) Predicting code-switching in multilingual communication for immigrant communities. In Proceedings of the First Workshop on Computational Approaches to Code Switching, Doha, Qatar, pp. 4250.CrossRefGoogle Scholar
Patro, J., Samanta, B., Singh, S., Basu, A., Mukherjee, P., Choudhury, M. and Mukherjee, A. (2017) All that is English may be Hindi: Enhancing language identification through automatic ranking of the likeliness of word borrowing in social media. In Proceedings of the Conference on Empirical Methods in Natural Language Processing, Copenhagen, Denmark, pp. 22642274.CrossRefGoogle Scholar
Piergallini, M., Shirvani, R., Gautam, G.S. and Chouikha, M. (2016) Word-level language identification and predicting codeswitching points in swahili-english language data. In Proceedings of the Second Workshop on Computational Approaches to Code Switching, Texas, USA, pp. 2129.CrossRefGoogle Scholar
Rei, M., Crichton, G. and Pyysalo, S. (2016). Attending to characters in neural sequence labeling models. In Proceedings of the 26th International Conference on Computational Linguistics, Osaka, Japan, pp. 309318.Google Scholar
Rijhwani, S., Sequiera, R., Choudhury, M., Bali, K. and Maddila, C.S. (2017) Estimating code-switching on twitter with a novel generalized word-level language detection technique. In Proceedings of the Annual Meeting of the Association for Computational Linguistics, Vancouver, Canada, pp. 1971–1982.CrossRefGoogle Scholar
Rudra, K., Sharma, A., Bali, K., Choudhury, M. and Ganguly, N. (2019) Identifying and analyzing different aspects of English-Hindi code-switching in Twitter. ACM Transactions on Asian and Low-Resource Language Information Processing 18.CrossRefGoogle Scholar
Samih, Y., Maharjan, S., Attia, M., Kallmeyer, L. and Solorio, T. (2016) Multilingual code-switching identification via LSTM recurrent neural networks. In Proceedings of the Second Workshop on Computational Approaches to Code Switching, Texas, USA, pp. 50–59.CrossRefGoogle Scholar
Sarma, N., Sanasam, R. and Goswami, D. (2019) Influence of social conversational features on language identification in highly multilingual online conversations. Information Processing & Management 56, 151166.CrossRefGoogle Scholar
Sarma, N., Singh, S.R. and Goswami, D. (2018) Word level language identification in Assamese-Bengali-Hindi-English code-mixed social media text. In Proceedings of the 2018 International Conference on Asian Language Processing (IALP), Bandung, Indonesia, pp. 261–266.CrossRefGoogle Scholar
Sikdar, U.K. and Gambäck, B. (2016) Language identification in code-switched text using conditional random fields and babelnet. In Proceedings of the Second Workshop on Computational Approaches to Code Switching, Texas, USA, pp. 127–131.CrossRefGoogle Scholar
Singh, K., Sen, I. and Kumaraguru, P. (2018) A Twitter corpus for Hindi-English code mixed pos tagging. In Proceedings of the Sixth International Workshop on Natural Language Processing for Social Media, Melbourne, Australia, pp. 12–17.CrossRefGoogle Scholar
Solorio, T., Blair, E., Maharjan, S., Bethard, S., Diab, M., Ghoneim, M., Hawwari, A., AlGhamdi, F., Hirschberg, J. and Chang, A. (2014) Overview for the first shared task on language identification in code-switched data. In Proceedings of the First Workshop on Computational Approaches to Code Switching, Doha, Qatar, pp. 62–72.CrossRefGoogle Scholar
Volkova, S., Ranshous, S. and Phillips, L. (2018) Predicting foreign language usage from English-only social media posts. In Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 2 (Short Papers), New Orleans, Louisiana, pp. 608–614.CrossRefGoogle Scholar
Vyas, Y., Gella, S., Sharma, J., Bali, K. and Choudhury, M. (2014) Pos tagging of English-Hindi code-mixed social media content. In Proceedings of the Conference on Empirical Methods in Natural Language Processing, Doha, Qatar, pp. 974–979.CrossRefGoogle Scholar
Wang, P., Bojja, N. and Kannan, S. (2015) A language detection system for short chats in mobile games. In Proceedings of the International Workshop on Natural Language Processing for Social Media, Denver, Colorado, pp. 20–28.CrossRefGoogle Scholar
Xia, M.X. (2016) Codeswitching language identification using subword information enriched word vectors. In Proceedings of The Second Workshop on Computational Approaches to Code Switching, Texas, USA, pp. 132–136.CrossRefGoogle Scholar
Yang, X. and Liang, W. (2010) An n-gram-and-wikipedia joint approach to natural language identification. In Proceedings of the International Universal Communication Symposium, Beijing, China, pp. 332–339.CrossRefGoogle Scholar
Yip, V. and Matthews, S. 2016. Code-mixing and mixed verbs in Cantonese-English bilingual children: Input and innovation. Languages, MDPI, 1(1):418.CrossRefGoogle Scholar
Zhang, Y., Riesa, J., Gillick, D., Bakalov, A., Baldridge, J. and Weiss, D. (2018) A fast, compact, accurate model for language identification of codemixed text. In Proceedings of the Conference on Empirical Methods in Natural Language Processing, Brussels, Belgium, pp. 328–337.CrossRefGoogle Scholar
Zubiaga, A., San Vicente, I., Gamallo, P., Pichel, J.R., Alegria, I., Aranberri, N., Ezeiza, A. and Fresno, V. (2016) Tweetlid: A benchmark for tweet language identification. Language Resources and Evaluation Journal 50, 729766.CrossRefGoogle Scholar