Hostname: page-component-cd9895bd7-jn8rn Total loading time: 0 Render date: 2024-12-23T05:20:34.890Z Has data issue: false hasContentIssue false

Social media text normalization for Turkish

Published online by Cambridge University Press:  02 June 2017

GÜLŞEN ERYİǦİT
Affiliation:
Department of Computer Engineering, Istanbul Technical University, Istanbul, Turkey e-mail: [email protected], [email protected]
DİLARA TORUNOǦLU-SELAMET
Affiliation:
Department of Computer Engineering, Istanbul Technical University, Istanbul, Turkey e-mail: [email protected], [email protected]

Abstract

Text normalization is an indispensable stage in processing noncanonical language from natural sources, such as speech, social media or short text messages. Research in this field is very recent and mostly on English. As is known from different areas of natural language processing, morphologically rich languages (MRLs) pose many different challenges when compared to English. Turkish is a strong representative of MRLs and has particular normalization problems that may not be easily solved by a single-stage pure statistical model. This article introduces the first work on the social media text normalization of an MRL and presents the first complete social media text normalization system for Turkish. The article conducts an in-depth analysis of the error types encountered in Web 2.0 Turkish texts, categorizes them into seven groups and provides solutions for each of them by dividing the candidate generation task into separate modules working in a cascaded architecture. For the first time in the literature, two manually normalized Web 2.0 datasets are introduced for Turkish normalization studies. The exact match scores of the overall system on the provided datasets are 70.40 per cent and 67.37 per cent (77.07 per cent with a case insensitive evaluation).

Type
Articles
Copyright
Copyright © Cambridge University Press 2017 

Access options

Get access to the full version of this content by using one of the access options below. (Log in options will check for institutional or personal access. Content may require purchase if you do not have access.)

References

Adalı, K., and Eryiğit, G. 2014. Vowel and diacritic restoration for social media texts (LASM) at EACL. In Proceedings of 5th Workshop on Language Analysis for Social Media, Gothenburg, Sweden, pp. 5361.Google Scholar
Ageno, A., Comas, P. R., Padró, L., and Turmo, J. 2013. The TALP-UPC approach to Tweet-Norm 2013. In Proceedings of the Tweet Normalization Workshop (TWEET-NORM) at SEPLN, Madrid, Spain, p. 58.Google Scholar
Akhtar, Md S., Sikdar, U. K., and Ekbal, A. 2015. IITP: multiobjective differential evolution based Twitter named entity recognition. In Proceedings of the Workshop on Noisy User-Generated Text at ACL, Beijing, China, pp. 61–7.Google Scholar
Akın, A. A., and Akın, M. D. 2007. Zemberek, an open source nlp framework for Turkic languages.Google Scholar
Alegria, I., Aranberri, N., Comas, P. R., Fresno, V., Gamallo, P., Padró, L., San Vicente, I., Turmo, J., and Zubiaga, A., 2015. Tweetnorm: a benchmark for lexical normalization of Spanish tweets. Language Resources and Evaluation 49 (4): 883905.Google Scholar
Alegria, I., Aranberri, N., Fresno, V., Gamallo, P., Padró, L., San Vicente, I., Turmo, J., and Zubiaga, A. 2013. Introducción a la tarea compartida tweet-norm 2013: normalización léxica de tuits en Español. In Proceedings of the Tweet Normalization Workshop (TWEET-NORM) at SEPLN, Madrid, Spain, pp. 19.Google Scholar
Alex, B., Dubey, A., and Keller, F. 2007. Using foreign inclusion detection to improve parsing performance. In Proceedings of EMNLP-CONLL, Prague, Czech, pp. 151–60.Google Scholar
Aw, A., Zhang, M., Xiao, J., and Su, J. 2006. A phrase-based statistical model for sms text normalization. In Proceedings of the COLING/ACL. Morristown, NJ, USA, pp. 3340.Google Scholar
Baldwin, T., Kim, Y.-B., de Marneffe, M. C., Ritter, A., Han, B., and Xu, W. 2015. Shared tasks of the 2015 workshop on noisy user-generated text: twitter lexical normalization and named entity recognition. In Proceedings of ACL-IJCNLP 2015, Beijing, China, p. 126.Google Scholar
Baldwin, T., and Li, Y. 2015. An in-depth analysis of the effect of text normalization in social media. In Proceedings of NAACL, Denver, Colorado, pp. 420–9.Google Scholar
Beaufort, R., Roekhaut, S., Cougnon, L.-A., and Fairon, C. 2010. A hybrid rule/model-based finite-state framework for normalizing sms messages. In Proceedings of ACL ’10, Stroudsburg, PA, USA, pp. 770–9.Google Scholar
Beckley, R. 2015. Bekli: a simple approach to Twitter text normalization. In Proceedings of the Workshop on Noisy User-Generated Text at ACL, Beijing, China, pp. 82–6.Google Scholar
Berend, G., and Tasnádi, E. 2015. Uszeged: correction type-sensitive normalization of English tweets using efficiently indexed n-gram statistics. In Proceedings of the Workshop on Noisy User-Generated Text at ACL, Beijing, China, pp. 120–5.Google Scholar
Blevins, T., Kwiatkowski, R., Macbeth, J., McKeown, K., Patton, D., and Rambow, O. 2016. Automatically processing tweets from gang-involved youth: towards detecting loss and aggression. In Proceedings of COLING. Osaka, Japan, pp. 2196–206.Google Scholar
Clark, E., and Araki, K., 2011. Text normalization in social media: progress, problems and applications for a pre-processing system of casual English. Procedia-social and Behavioral Sciences 27 : 211.Google Scholar
Cook, P., and Stevenson, S. 2009. An unsupervised model for text message normalization. In Proceedings of the Workshop on Computational Approaches to Linguistic Creativity at NAACL-HLT, Stroudsburg, PA, USA, pp. 71–8.Google Scholar
Crystal, D. 2008. Txtng: The gr8 db8. OUP Oxford, New York.Google Scholar
Das, A., and Gambäck, B., 2013. Code-mixing in social media text: the last language identification frontier. Traitement Automatique des Langues (TAL): Special Issue on Social Networks and NLP 54 (3): 6579.Google Scholar
De Clercq, O., Desmet, B., Schulz, S., Lefever, E., and Hoste, V. 2013. Normalization of Dutch user-generated content. In Proceedings of Recent Advances in Natural Language Processing, Hissar, Bulgaria, pp. 179–88.Google Scholar
Doval Mosquera, Y., Vilares, J., and Gómez-Rodríguez, C. 2015. Lysgroup: adapting a Spanish microtext normalization system to English. In Proceedings of the Workshop on Noisy User-Generated Text at ACL, Beijing, China, pp. 99105.Google Scholar
Eger, S., et al. 2016. A comparison of four character-level string-to-string translation models for (OCR) spelling error correction. The Prague Bulletin of Mathematical Linguistics 105 (1): 7799.Google Scholar
Egidio, Y. M. O. F. P., and Coupé, M. C. 2013. A quantitative and typological approach to correlating linguistic complexity. In Proceedings of the 5th Conference on Quantitative Investigations in Theoretical Linguistics, University of Leuven, pp. 71–5.Google Scholar
Eisenstein, J. 2013a. Phonological factors in social media writing. In Proceedings of the Workshop on Language Analysis in Social Media, Atlanta, Georgia: Association for Computational Linguistics, pp. 11–9.Google Scholar
Eisenstein, J. 2013b. What to do about bad language on the internet. In Proceedings of NAACL-HLT, Atlanta, Georgia, pp. 359–69.Google Scholar
Eryiğit, G. 2007. ITU treebank annotation tool. In Proceedings of Workshop on Linguistic Annotation (LAW) at ACL, Prague, Czech, pp. 117–20.Google Scholar
Eryiğit, G. 2014. ITU Turkish NLP web service. In Proceedings of the Demonstrations at EACL, Gothenburg, Sweden, pp. 18 Google Scholar
Eryiğit, G., and Adalı, E. 2004. An affix stripping morphological analyzer for Turkish. In Proceedings of the International Conference on Artificial Intelligence and Applications, Inssbruck, pp. 299304.Google Scholar
Eryigit, G., Cetin, F. S., Yanık, M., Temel, T., and Ciçekli, I. 2013. Turksent: a sentiment annotation tool for social media. In Proceedings of the 7th Linguistic Annotation Workshop and Interoperability with Discourse at ACL, Sofia, Bulgaria, pp. 131–4.Google Scholar
Eskander, R., Al-Badrashiny, M., Habash, N., and Rambow, O. 2014. Foreign words and the automatic processing of Arabic social media text written in roman script. In Proceedings of the 1st Workshop on Computational Approaches to Code Switching at ACL, Doha, Qatar, pp. 112.Google Scholar
Gal, Y. 2002. An HMM approach to vowel restoration in Arabic and Hebrew. In Proceedings of Workshop on Computational Approaches to Semitic Languages at ACL, Stroudsburg, PA, USA, pp. 17.Google Scholar
Hakkani-Tür, D. Z., Oflazer, K., and Tür, G. 2000. Statistical morphological disambiguation for agglutinative languages. In Proceedings of COLING Stroudsburg, PA, USA, pp. 285–91.Google Scholar
Han, B., and Baldwin, T. 2011. Lexical normalisation of short text messages: Makn sens a #twitter. In Proceedings of ACL-HLT, Portland, Oregon, USA, pp. 368–78.Google Scholar
Han, B., Cook, P., and Baldwin, T. 2013. Lexical normalization for social media text. ACM Transactions on Intelligent Systems and Technology (TIST) 4 (1): 5:1–27.Google Scholar
Hassan, H., and Menezes, A. 2013. Social text normalization using contextual graph random walks. In Proceedings of ACL, Sofia, Bulgaria, pp. 1577–86.Google Scholar
Ingason, A. K., Jóhannsson, S. B., Rögnvaldsson, E., Loftsson, H., and Helgadóttir, S. 2009. Context-sensitive spelling correction and rich morphology. In Proceedings of the 17th Nordic Conference of Computational Linguistics (NODALIDA), Odense, Denmark, pp. 231–4.Google Scholar
Jahjah, V., Khoury, R., and Lamontagne, L. 2016. Word Normalization using Phonetic Signatures, pp. 180–5. Cham: Springer International Publishing.Google Scholar
Jhamtani, H., Bhogi, S. K., and Raychoudhury, V. 2014. Word-level language identification in bi-lingual code-switched texts. In Proceedings of the 28th Pacific Asia Conference on Language, Information, and Computation, Phuket, Thailand, pp. 348–57.Google Scholar
Jia, Y., Huang, D., Liu, W., Dong, Y., Yu, S., and Wang, H. 2008. Text normalization in Mandarin text-to-speech system. In Acoustics, Speech and Signal Processing (ICASSP), pp. 4693–6. IEEE, Las Vegas.Google Scholar
Jin, N. 2015. Ncsu-sas-ning: candidate generation and feature engineering for supervised lexical normalization. In Proceedings of the Workshop on Noisy User-Generated Text at ACL, Beijing, China, pp. 8792.Google Scholar
Kaufmann, M., and Kalita, J. 2010. Syntactic normalization of Twitter messages. In Proceedings of the 8th International Conference on Natural Language Processing (ICON), Chennai, India, pp. 17 Google Scholar
Khan, O. A., and Karim, A. 2012. A rule-based model for normalization of sms text. In Proceedings of the International Conference on Tools with Artificial Intelligence (ICTAI), Athens, Greece, pp. 634–41.Google Scholar
Kobus, C., Yvon, F., and Damnati, G. 2008. Normalizing sms: are two metaphors better than one? Proceedings of COLING, Manchester, UK, pp. 441–8.Google Scholar
Kukich, K., 1992. Techniques for automatically correcting words in text. ACM Computing Surveys (CSUR) 24 (4): 377439.Google Scholar
Labov, W. 1969. A Study of Non-Standard English, Educational resources information center. ERIC Clearinghouse for Linguistics, Washington. D.C. Google Scholar
Lacoste, V. 2012. Phonological Variation in Rural Jamaican Schools, Creole language library. John Benjamins Publishing Company, Amsterdam.Google Scholar
Lafferty, J., McCallum, A., and Pereira, F. 2001. Conditional random fields: probabilistic models for segmenting and labeling sequence data. In Proceedings of the 18th International Conference on Machine Learning (ICML), San Francisco, CA, USA, pp. 282–9.Google Scholar
Leeman-Munk, S., Lester, J., and Cox, J. 2015. Ncsu_sas_sam: deep encoding and reconstruction for normalization of noisy text. In Proceedings of the Workshop on Noisy User-Generated Text at ACL, Beijing, China, pp. 154–61.Google Scholar
Leeman-Munk, S. P. 2016. Morphosyntactic Neural Analysis for Generalized Lexical Normalization. Ph.D. thesis, North Carolina State University.Google Scholar
Li, C., and Liu, Y. 2014. Improving text normalization via unsupervised model and discriminative reranking. In Proceedings of the ACL Student Research Workshop, Baltimore, Maryland, USA, pp. 8693.Google Scholar
Limsopatham, N., and Collier, N. 2015. Adapting phrase-based machine translation to normalise medical terms in social media messages. In Proceedings of EMNLP, Lisbon, Portugal, pp. 1675–80.Google Scholar
Liu, F., Weng, F., and Jiang, X. 2012. A broad-coverage normalization system for social media language. In Proceedings of ACL, Stroudsburg, PA, USA, pp. 1035–44.Google Scholar
Lui, M., Lau, J. H., and Baldwin, T., 2014. Automatic detection and language identification of multilingual documents. Transactions of the Association for Computational Linguistics 2 : 2740.Google Scholar
Max, A., and Wisniewski, G. 2010. Mining naturally-occurring corrections and paraphrases from Wikipedia’s revision history. In Proceedings of LREC, Valletta, Malta, pp. 3143–8.Google Scholar
McCallum, A. K. 2002. Mallet: A Machine Learning for Language Toolkit. http://mallet.cs.umass.edu.Google Scholar
McKean, E. 2005. The New Oxford American Dictionary, vol. 2. New York: Oxford University Press.Google Scholar
Melero, M., Costa-Jussà, M. R., Lambert, P., and Quixal, M., 2016. Selection of correction candidates for the normalization of Spanish user-generated content. Natural Language Engineering 22 (1): 135–61.Google Scholar
Microsoft,. 2010. Microsoft Word, Version 10.0. Microsoft.Google Scholar
Min, W., and Mott, B. 2015. Ncsu_sas_wookhee: a deep contextual long-short term memory model for text normalization. In Proceedings of the Workshop on Noisy User-Generated Text at ACL, Beijing, China, pp. 111–9.Google Scholar
Muhammad, A., Wiratunga, N., and Lothian, R. 2015. Context-aware sentiment analysis of social media. In Advances in Social Media Analysis, Switzerland, pp. 87104.Google Scholar
Nguyen, T.-T., Thi, P., Thanh, T., and Tran, D.-D. 2010. A method for Vietnamese text normalization to improve the quality of speech synthesis. In Proceedings of the 2010 Symposium on Information and Communication Technology, New York, NY, USA, pp. 7885.Google Scholar
Och, F. J., and Ney, H., 2003. A Systematic Comparison of Various Statistical Alignment Models. Computational Linguistics 29 (1): 1951.Google Scholar
Oflazer, K., 1996. Error-tolerant finite-state recognition with applications to morphological analysis and spelling correction. Computational Linguistics 22 (1): 7389.Google Scholar
Pamay, T., Sulubacak, U., Torunoğlu-Selamet, D., and Eryiğit, G. 2015. The annotation process of the ITU web treebank. In Proceedings of LAW Workshop at NAACL, Denver, Colorado, pp. 95101.Google Scholar
Panchapagesan, K., Talukdar, P. P., Krishna, N. S., Bali, K., and Ramakrishnan, A. G. 2004. Hindi text normalization. In Proceedings of the 5th International Conference on Knowledge Based Computer Systems, India, pp. 1922.Google Scholar
Pennell, D., and Liu, Y. 2011. A character-level machine translation approach for normalization of sms abbreviations. In Proceedings of the International Joint Conference on Natural Language Processing, Chiang Mai, Thailand, pp. 974–82.Google Scholar
Pirinen, T. A., and Lindén, K. 2010. Finite-state spell-checking with weighted language and error models. In Proceedings the Workshop on Creation and Use of Basic Lexical Resources for Less-Resourced Languages at LREC, Valetta, Malta, pp. 13–8.Google Scholar
Pirinen, T. A., and Lindén, K. 2014. State-of-the-art in weighted finite-state spell-checking. In Proceedings of the International Conference on Computational Linguistics and Intelligent Text Processing, Kathmandu, Nepal, pp. 519–32.Google Scholar
Porta, J., and Sancho, J.-L. 2013. Word normalization in Twitter using finite-state transducers. In Proceedings of the Tweet Normalization Workshop (TWEET-NORM) at SEPLN, Madrid, Spain, pp. 4953.Google Scholar
Qian, T., Zhang, Y., Zhang, M., Ren, Y., and Ji, D. 2015. A transition-based model for joint segmentation, pos-tagging and normalization. In Proceedings of EMNLP, Lisbon, Portugal, pp. 1837–46.Google Scholar
Şahin, M., Sulubacak, U., and Eryiğit, G. 2013. Redefinition of Turkish morphology using flag diacritics. Proceedings of the 10th Symposium on Natural Language Processing (SNLP-2013), Pukhet, Thailand, pp. 18.Google Scholar
Sak, H., Güngör, T., and Saraçlar, M. 2011. Resources for Turkish morphological processing. Language Resources and Evaluation 45 (2): pp. 249–61.Google Scholar
Saloot, M. A., Idris, N., and Mahmud, R. 2014. An architecture for Malay tweet normalization. Information Processing & Management 50 (5): pp. 621–33.Google Scholar
Sanches Duran, M., Volpe Nunes, M. das Graças, and Avanço, L. 2015. A normalizer for UGC in Brazilian Portuguese. In Proceedings of the Workshop on Noisy User-Generated Text at ACL, Beijing, China, pp. 3847.Google Scholar
Sarikaya, R., Kirchhoff, K., Schultz, T., and Hakkani-Tur, D., 2009. Introduction to the special issue on processing morphologically rich languages. IEEE Transactions on Audio, Speech, and Language Processing 17 (5): 861–2.Google Scholar
Say, B., Zeyrek, D., Oflazer, K., and Özge, U. 2002. Development of a corpus and a treebank for present-day written Turkish. In Proceedings of the 11th International Conference of Turkish Linguistics, Northern Cyprus.Google Scholar
Schulz, S., Pauw, G. De, Clercq, O. De, Desmet, B., Hoste, V., Daelemans, W., and Macken, L., 2016. Multimodular text normalization of Dutch user-generated content. ACM Transactions on Intelligent Systems and Technology 7 (4): 122.Google Scholar
Şeker, G. A., and Eryiğit, G. 2012. Initial explorations on using CRFs for Turkish named entity recognition. In Proceedings of COLING 2012, Bombay, India, pp. 2459–74.Google Scholar
Şeker, G., and Eryiğit, G., 2017. Extending a CRF-based named entity recognition model for Turkish well formed text and user generated content. Semantic Web Journal 8 (5): 625–42.Google Scholar
Silfverberg, M., Kauppinen, P., and Lindén, K. 2016. Data-driven spelling correction using weighted finite-state methods. In Proceedings of the Workshop on Statistical NLP and Weighted Automa, Berlin, Germany, pp. 51–9.Google Scholar
Smith, A., Cohn, T., and Osborne, M. 2005. Logarithmic opinion pools for conditional random fields. In Proceedings of ACL, Ann Arbor, Michigan, USA, pp. 1825.Google Scholar
Solorio, T., Blair, E., Maharjan, S., Bethard, S., Diab, M., Ghoneim, M., Hawwari, A., AlGhamdi, F., Hirschberg, J., Chang, A., and Fung, P. 2014. Overview for the first shared task on language identification in code-switched data. In Proceedings of the 1st Workshop on Computational Approaches to Code Switching at ACL, Doha, Qatar, pp. 6272.Google Scholar
Sridhar, R., and Kumar, V. 2015. Unsupervised text normalization using distributed representations of words and phrases. In Proceedings of the 1st Workshop on Vector Space Modeling for Natural Language Processing at ACL, Denver, Colorado, pp. 816.Google Scholar
Supranovich, D., and Patsepnia, V. 2015. Ihs_rd: lexical normalization for English tweets. Proceedings of the Workshop on Noisy User-Generated Text at ACL, Beijing, China, pp. 7881.Google Scholar
Torunoǧlu, D., and Eryiğit, G. 2014. A cascaded approach for social media text normalization of Turkish. In Proceedings of the 5th Workshop on Language Analysis for Social Media at EACL, Gothenburg, Sweden, pp. 6270.Google Scholar
Torunoğlu-Selamet, D., Bekar, E., Ilbay, T., and Eryiğit, G. 2016. Exploring spelling correction approaches for Turkish. In Proceedings of the 1st International Conference on Turkic Computational Linguistics at CICLING, Konya, pp. 711.Google Scholar
Tsarfaty, R., Seddah, D., Goldberg, Y., Kübler, S., Candito, M., Foster, J., Versley, Y., Rehbein, I., and Tounsi, L. 2010. Statistical parsing of morphologically rich languages (SPMRL): what, how and whither. In Proceedings of the 1st Workshop on Statistical Parsing of Morphologically Rich Languages at NAACL-HLT, Stroudsburg, PA, USA, pp. 112.Google Scholar
Tür, G. 2000. A Statistical Information Extraction System for Turkish. PhD Thesis, Department of Computer Engineering and the Institute of Engineering and Science of Bilkent University, Ankara.Google Scholar
Tür, G., Hakkani-Tür, D., and Oflazer, K., 2003. A statistical information extraction system for Turkish. Natural Language Engineering 9 (2): 181210.Google Scholar
Vilares, J., Alonso, M., and Vilares, D. 2013. Prototipado rápido de un sistema de normalización de tuits: una aproximación léxica. In Proceedings of the Tweet Normalization Workshop (TWEET-NORM) at SEPLN, Madrid, Spain, pp. 3943.Google Scholar
Wagner, J., and Foster, J. 2015. Dcu-adapt: learning edit operations for microblog normalisation with the generalised perceptron. In Proceedings of the Workshop on Noisy User-Generated Text at ACL, Beijing, China, pp. 93–8.Google Scholar
Wang, P., and Ng, H. T. 2013. A beam-search decoder for normalization of social media text with application to machine translation. In Proceedings of NAACL-HLT, Atlanta, Georgia, pp. 471–81.Google Scholar
Wang, Z., Xu, G., Li, H., and Zhang, M. 2011. A fast and accurate method for approximate string search. In Proceedings of ACL-HLT, Stroudsburg, PA, USA, pp. 5261.Google Scholar
Xu, K., Xia, Y., and Lee, C.-H. 2015. Tweet normalization with syllables. In Proceedings of ACL-IJCNLP, Beijing, China, pp. 920–8.Google Scholar
Yang, Y., and Eisenstein, J. 2013. A log-linear model for unsupervised text normalization. In Proceedings of EMNLP, Seattle, Washington, USA, pp. 6172.Google Scholar
Yüret, D., and De La Maza, M. 2006. The greedy prepend algorithm for decision list induction. In Proceedings of the 21st International Conference on Computer and Information Sciences, Berlin, Heidelberg, pp. 3746.Google Scholar
Zhang, C., Baldwin, T., Ho, H., Kimelfeld, B., and Li, Y. 2013. Adaptive parser-centric text normalization. In Proceedings of ACL, Sofia, Bulgaria, pp. 1159–68.Google Scholar
Zhang, Q., Chen, H., and Huang, X. 2014. Chinese-English mixed text normalization. In Proceedings of the 7th ACM International Conference on Web Search and Data Mining, New York, NY, USA, pp. 433–42.Google Scholar
Zitouni, I., Sorensen, J., and Sarikaya, R. 2006. Maximum entropy based restoration of Arabic diacritics. In Proceedings of COLING-ACL, Stroudsburg, PA, USA, pp. 577–84.Google Scholar