Hostname: page-component-745bb68f8f-5r2nc Total loading time: 0 Render date: 2025-01-25T05:58:35.714Z Has data issue: false hasContentIssue false

A SMS normalization system integrating multiple grammatical resources

Published online by Cambridge University Press:  07 June 2012

J. OLIVA
Affiliation:
Bioengineering Group.Spanish National Research Council (CSIC)Ctra. de Campo Real, km 0,200. La Poveda, Arganda del Rey. CP: 28500, Madrid, Spain e-mails: [email protected], [email protected], [email protected], [email protected]
J. I. SERRANO
Affiliation:
Bioengineering Group.Spanish National Research Council (CSIC)Ctra. de Campo Real, km 0,200. La Poveda, Arganda del Rey. CP: 28500, Madrid, Spain e-mails: [email protected], [email protected], [email protected], [email protected]
M. D. DEL CASTILLO
Affiliation:
Bioengineering Group.Spanish National Research Council (CSIC)Ctra. de Campo Real, km 0,200. La Poveda, Arganda del Rey. CP: 28500, Madrid, Spain e-mails: [email protected], [email protected], [email protected], [email protected]
Á. IGESIAS
Affiliation:
Bioengineering Group.Spanish National Research Council (CSIC)Ctra. de Campo Real, km 0,200. La Poveda, Arganda del Rey. CP: 28500, Madrid, Spain e-mails: [email protected], [email protected], [email protected], [email protected]

Abstract

SMS language presents special phenomena and important deviations from natural language. Every day, an impressive amount of chat messages, SMS messages, and e-mails are sent all over the world. This widespread use makes important the development of systems that normalize SMS language into natural language. However, typical machine translation approaches are difficult to adapt to SMS language because of many irregularities that are shown by this kind of language. This paper presents a new approach for SMS normalization that combines lexical and phonological translation techniques with disambiguation algorithms at two different levels: lexical and semantic. The method proposed does not depend on big annotated corpus, which is difficult to build and is applied in two different domains showing its easiness of adaptation across different languages and domains. The results obtained by the system outperform some of the existing methods of SMS normalization despite the fact that the Spanish language and the corpus created have some features that complicate the normalization task.

Type
Articles
Copyright
Copyright © Cambridge University Press 2012

Access options

Get access to the full version of this content by using one of the access options below. (Log in options will check for institutional or personal access. Content may require purchase if you do not have access.)

References

af Segerstad, Y. H. 2002. Use and Adaptation of Written Language to the Conditions of Computer-Mediated Communication. PhD thesis, Department of Linguistics, Göteborg University, Sweden.Google Scholar
Aw, A., Zhang, M., Xiao, J. and Su, J. 2006. A phrase-based statistical model for SMS text normalization. In Proceedings of the COLING/ACL on Main Conference Poster Sessions, Sidney, Australia. Stroudsburg, PA USA: Association for Computational Linguistics, pp. 3340.CrossRefGoogle Scholar
Banerjee, S. and Pedersen, T. 2003. Extended gloss overlaps as a measure of semantic relatedness. In Proceedings of the Eighteenth International Joint Conference on Artificial Intelligence, Acapulco, Mexico, pp. 805–10.Google Scholar
Baron, N. S. 2000. Alphabet to E-mail: How Written English Evolved and Where it's Heading. London, UK: Taylor & Francis.Google Scholar
Baron, N. S. 2004. Computer-mediated communication as a force in language change. Visible Language 18 (2): 118–41.Google Scholar
Beaufort, R., Roekhaut, S., Cougnon, L. A. and Fairon, C. 2010. A hybrid rule/model-based finite-state framework for normalizing sms messages. In Proceedings of the ACL, Uppsala, Sweden. Stroudsburg, PA, USA: Association for Computational Linguistics, pp. 770–9.Google Scholar
Carreras, X., Chao, I., Padró, Ll., and Padró, M. 2004. Freeling: an open-source suite of language analyzers. Proceedings of the 4th International Conference on Language Resources and Evaluation (LREC'04), Lisbon, Portugal.Google Scholar
Choudhury, M., Saraf, R., Jain, V., Mukherjee, A., Sarkar, S., and Basu, A. 2007. Investigation and modeling of the structure of texting language. International Journal on Document Analysis and Recognition 10 (3): 157–74.CrossRefGoogle Scholar
Cook, P. and Stevenson, S. 2009. An unsupervised model for text messages normalization. In Proceedings of the NAACL HLT Workshop on Computational Approaches to Linguistic Creativity, Boulder, Colorado. Stroudsburg, PA, USA: Association for Computational Linguistics, pp. 71–8.CrossRefGoogle Scholar
Crystal, D. 2001. Language and the Internet.Cambridge, UK: Cambridge University Press.CrossRefGoogle Scholar
Delany, S. J., Buckley, M. and Greene, D. 2012. SMS spam filtering: methods and data. Expert Systems with Applications, 39 (10): 98999908.CrossRefGoogle Scholar
Döring, N. 2002. Kurzm wird gesendet-abkürzungen und akronyme in der sms-kommunikation. Muttersprache Vierteljahresschrift für deutsche Sprache 112 (2): 97115.Google Scholar
Fairon, C. and Paumier, S. 2006. A translated corpus of 30,000 French sms. Proceedings of LREC 2006, Genoa, Italy.Google Scholar
Fellbaum, C. 1998. WordNet: An Electronic Lexical Database. Cambridge, MA: MIT Press.CrossRefGoogle Scholar
Ferri, S. 2005. El fenómeno de economía linguística en el lenguaje SMS: breve estudio experimental en alumnos de 16 años. In Posteguillo, S., Esteve, M J., Gea, M Ll., Insa, S., and Renau, M L. (eds.), Proceedings of 2nd International Conference on Internet and Language, pp. 255270. Castellón, Espaa: Servicio de Publicaciones de la Universitat Jaume I.Google Scholar
Guimier de Neef, É., Debeurme, A., and Park, J. 2007. Tilt correcteur de sms: évaluation et bilan qualitatif. In Actes TALN 2007, Toulouse, France, pp. 123–32.Google Scholar
Herring, S. C. 2001. Computer-mediated discourse. In Schiffrin, D., Tannen, D., and Hamilton, H. (eds.), The Handbook of Discourse Analysis, pp. 612–34. Oxford, UK: Blackwell.Google Scholar
Kobus, C., Yvon, F. and Damnati, G. 2008. Normalizing sms: are two metaphors better than one? In Proceedings of the 22nd International Conference on Computational Linguistics (COLING 2008),Manchester, UK, pp. 441–8.CrossRefGoogle Scholar
Levenshtein, Vladimir I. 1966. Binary codes capable of correcting deletions, insertions, and reversals. Soviet Physics Doklady 10 (8): 707–10.Google Scholar
Michelizzi, J. 2005. Semantic Relatedness Applied to All Word Sense Disambiguation. Master's thesis, University of Minnesota, Minnesota, USA.Google Scholar
Miraflores, E. and Hernández, J. L. 2006. Lenguaje sms: la alfabetización de los jóvenes del siglo xxi. Educación y Futuro: Revista de Investigación Aplicada y Experiencias Educativas 11: 121–30.Google Scholar
Nishimura, Y. 2003. Linguistic innovations and interactional features of casual online communication in Japanese. Journal of Computer Mediated Communications 9 (1)http://jcmc.indiana.edu/vol9/issue1/nishimura.html.Google Scholar
Oliva, J., Serrano, J. I., del Castillo, M. D., and Iglesias, A. 2011. SMS normalization: combining phonetics, morphology and semantics. In Lozano, J., Gmez, J., and Moreno, J. (eds.), Advances in Artificial Intelligence, Lecture Notes in Computer Science, pp. 273–82. Berlin, Germany: Springer.CrossRefGoogle Scholar
Papinieni, K., Roukos, S., Ward, T. and Zhu, W. 2001. Bleu: a method for automatic evaluation of machine translation. Technical Report RC22176 (W0109-022), IBM Research Division, Thomas J. Watson Research Center, New York, USA.Google Scholar
Pedersen, T., Banerjee, S. and Patwardhan, S. 2005, March. Maximizing semantic relatedness to perform word sense disambiguation. Technical Report UMSI 2005/25, University of Minnesota Supercomputing Institute, Minneapolis, MN, USA.Google Scholar
Philips, L. 2000. The double-metaphone search algorithm.C/C++ User's Journal, 18 (6).Google Scholar
Power, Mary R. and Power, D. 2004, June. Everyone here speaks txt: deaf people using sms in Australia and the rest of the world. Journal of Deaf Studies and Deaf Education, 9 (3): 333–43.CrossRefGoogle ScholarPubMed
Rosetta, M. 1994. Compositional Translation. Netherlands: Kluwer.CrossRefGoogle Scholar
Simard, M. and Deslauriers, A. 2001 Real-time automatic insertion of accents in French text. Journal of Natural Language Engineering 7 (2): 287333 (New York, NY, USA: Cambridge University Press).CrossRefGoogle Scholar
Sproat, R., Black, A., Chen, S., Kumar, S., Ostendrof, M., and Richards, C. 2001 Normalization of non-standar words. Computer Speech and Language 15 (3): 287333.CrossRefGoogle Scholar
Yvon, F. 2010. Rewriting the orthography of sms messages. Journal of Natural Language Engineering 16 (2):133–59 (New York, NY, USA: Cambrige University Press).CrossRefGoogle Scholar