Hostname: page-component-78c5997874-4rdpn Total loading time: 0 Render date: 2024-11-05T04:08:49.979Z Has data issue: false hasContentIssue false

Rewriting the orthography of SMS messages

Published online by Cambridge University Press:  24 March 2010

FRANÇOIS YVON*
Affiliation:
LIMSI-CNRS and Université Paris Sud 11, Paris, France e-mail: [email protected]

Abstract

Electronic written texts used in computer-mediated interactions (emails, blogs, chats, and the like) contain significant deviations from the norm of the language. This paper presents the detail of a system aiming at normalizing the orthography of French SMS messages: after discussing the linguistic peculiarities of these messages and possible approaches to their automatic normalization, we present, compare, and evaluate various instanciations of a normalization device based on weighted finite-state transducers. These experiments show that using an intermediate phonemic representation and training, our system outperforms an alternative normalization system based on phrase-based statistical machine translation techniques.

Type
Papers
Copyright
Copyright © Cambridge University Press 2010

Access options

Get access to the full version of this content by using one of the access options below. (Log in options will check for institutional or personal access. Content may require purchase if you do not have access.)

References

Allauzen, C., Mohri, M., and Roark, B. 2005. The design principles and algorithms of a weighted grammar library. International Journal of Foundations of Computer Science 16 (3): 403–21.CrossRefGoogle Scholar
Anis, J. 2001. Parlez-vous texto? Guide des nouveaux langages du réseau. Paris, France: Éditions du Cherche Midi.Google Scholar
Aw, A., Zhang, M., Xiao, J., and Su, J. 2006. A phrase-based statistical model for SMS text normalization. In Calzolari, N., Cardie, C., and Isabelle, P. (eds.) Proceedings of COLING/Association for Computational Linguistics, pp. 3340. Sydney, Australia: the Association for Computational Linguistics.Google Scholar
Bazzi, I., and Glass, J. 2000. Modelling OOV words for robust speech recognition. In Proceedings of the Internatinal Conference on Spoken Language Processing (ICSLP), pp. 401–4. Beijing, China: International Speech Communication Association (ISCA).Google Scholar
Beaufort, R., Roekhaut, S., and Fairon, C. 2008. Définition d'un système d'alignement SMS/français standard à l'aide d'un filtre de composition. In Heiden, S., and Pincemin, B. (eds.) Actes des Journées Internationales de l'Analyse des Données Textuelles (JADT), pp. 55166. Lyon: Presses Universitaires de Lyon.Google Scholar
Béchet, F. 2001. LIA_PHON: Un système complet de phonétisation de textes. Traitement Automatique des Langues 42 (1): 4767.Google Scholar
Boula de Mareüil, P., d'Alessandro, C., Yvon, F., Aubergé, V., Vaissière, J., and Amelot, A. 2000. A French phonetic lexicon with variants for speech and language processing. In Proceedings of the 2nd Language Resources Engineering Conference (LREC), vol. I, pp. 273–6, Athens, Greece.Google Scholar
Brill, E., and Moore, R. C. 2000. An improved error model for noisy channel spelling correction. In Huang, C.-N., and Vijay-Shankar, K., K..(eds.) Proceedings of the 38th Annual Meeting of the Association for Computational Linguistics, pp. 286–93. Hong Kong: the Association for Computational Linguistics.Google Scholar
Brown, P. F., Cocke, J., Pietra, S. D., Pietra, V. J. D., Jelinek, F., Lafferty, J. D., Mercer, R. L., and Roossin, P. S. 1990. A statistical approach to machine translation. Computational Linguistics 16 (2): 7985.Google Scholar
Chen, S. F., and Goodman, J. T. 1996. An empirical study of smoothing techniques for language modeling. In Joshi, A., and Palmer, M. (eds.) Proceedings of the 34th Annual Meeting of the Association for Computational Linguistics, pp. 310–18. Santa Cruz, NM: the Association for Computational Linguistics.CrossRefGoogle Scholar
Choudhury, M., Saraf, R., Jain, V., Sarkar, S., and Basu, A. 2007. Investigation and modeling of the structure of texting language. In Knoblock, C., Lopresti, D., Roy, S., and Subramaniam, L. Venkata (eds.) Proceedings of the IJCAI Workshop on ‘Analytics for Noisy Unstructured Text Data’, pp. 6370. Hyderabad, India: International Association for Pattern Recognition.Google Scholar
Church, K. W., and Gale, W. 1991. Probability scoring for spelling correction. Statistics and Computing 1: 91103.CrossRefGoogle Scholar
Clark, A. 2003. Pre-processing very noisy text. In Proceedings of Workshop on Shallow Processing of Large Corpora, Lancaster, UK.Google Scholar
Crystal, D. 2001. Language and the Internet. Cambridge, UK: Cambridge University Press.CrossRefGoogle Scholar
Divay, M., and Vitale, A. J. 1997. Algorithm for grapheme-to-phoneme translation for French and English: applications. Computational Linguistics 23 (4): 495524.Google Scholar
Eisner, J. 2002. Parameter estimation for probabilistic finite-state transducers. In Charniak, E., and Lin, D. (eds.) Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics, pp. 18. Philadelphia, PA: the Association for Computational Linguistics.Google Scholar
Fairon, C., Klein, J. R., and Paumier, S. 2006. Le langage SMS. Louvain, Belgium: UCL Presses.Google Scholar
Fairon, C., and Paumier, S. 2006. A translated corpus of 30,000 French SMS. In Proceedings of LREC 2006, Genoa, Italy.Google Scholar
Gillick, L., and Cox, S. 1989. Some statistical issues in the comparison of speech recognition algorithm. In Proceedings of the IEEE Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 532–35, Glasgow, UK.Google Scholar
Golding, A. R., and Roth, D. 1999. A winnow-based approach to context-sensitive spelling correction. Machine Learning 34: 107–30.Google Scholar
Golding, A. R., and Schabes, Y. 1996. Combining trigram-based and feature-based methods for context-sensitive spelling correction. In Joshi, A., and Palmer, M. (eds.) the Proceedings of the 34th Annual Meeting of the Association for Computational Linguistics, pp. 71–8. Santa Cruz, CA: the Association for Computational Linguistics.Google Scholar
Guimier de Neef, E., Debeurme, A., and Park, J. 2007. TILT correcteur de SMS: évaluation et bilan quantitatif. In Actes de la Conférence sur le Traitement Automatique des Langues (TALN'07), pp. 123–32, Toulouse, France.Google Scholar
Jansche, M. 2003. Inference of string mappings for language technology. PhD thesis, Ohio State University.Google Scholar
Jelinek, F. 1990. Self-organized language modeling for speech recognition. In Waibel, A., and Lee, K.-F. (eds.), Readings in Speech Recognition, pp. 450506, San Mateo, CA: Morgan-Kaufman.Google Scholar
Kaplan, R. M., and Kay, M. 1994. Regular models of phonological rule systems. Computational Linguistics 20 (3): 331–78. (First appeared as a paper presented to the Winter Meeting of the Linguistic Society of America, New York, 1981.).Google Scholar
Kobus, C., Yvon, F., and Damnati, G. 2008a. Normalizing SMS: are two metaphors better than one? In Scott, D., and Uszkoreit, H. (eds.) Proceedings of the 22nd International Conference on Computational Linguistics (COLING 2008), pp. 441–8. Manchester, UK: the Association for Computational Linguistics.Google Scholar
Kobus, C., Yvon, F., and Damnati, G. 2008b. Transcrire les SMS comme on reconnaît la parole. In Actes de la Conférence sur le Traitement Automatique des Langues (TALN'08), pp. 128–38, Avignon, France.Google Scholar
Koehn, P., Hoang, H., Birch, A., Callison-Burch, C., Federico, M., Bertoldi, N., Cowan, B., Shen, W., Moran, C., Zens, R., Dyer, C., Bojar, O., Constantin, A., and Herbst, E. 2007. Moses: open source toolkit for statistical machine translation. In Proceedings of the Annual Meeting of the Association for Computational Linguistics, Demonstration Session, Prague, Czech Republic.Google Scholar
Koehn, P., Och, F. J., and Marcu, D. 2003. Statistical phrase-based translation. In Hearst, M., and Ostendorf, M. (eds.) Proceedings of the Human Language Technology Conference of the North American Chapter of the Association for Computational Linguistic, pp. 127–33. Edmondton, AB, Canada: the Association for Computational Linguistics.Google Scholar
Kukich, K. 1992. Techniques for automatically correcting words in text. Computing Surveys 24 (4): 377439.Google Scholar
Lita, L. V., Ittycheriah, A., Roukos, S., and Kambhatla, N. 2003. tRuEcasIng. In Hinrichs, E. W., and Roth, D. (eds.) Proceedings of the 41st Annual Meeting on Association for Computational Linguistics, pp. 152–9. Sapporo, Japan: the Association for Computational Linguistics.Google Scholar
Mikheev, A. 2000. Document centered approach to text normalization. In SIGIR ‘00: Proceedings of the 23rd annual international ACM SIGIR conference on Research and development in information retrieval, pp. 136–43. New York: Association for Computing Machinery.Google Scholar
Mitton, R. 1996. Spellchecking by computer. Journal of the Simplified Spelling Society 20 (1): 411.Google Scholar
Mohri, M. 1997a. On the use of sequential transducers in natural language processing. In Roche, E., and Schabes, Y. (eds.), Finite State Natural Language Processing, Cambridge, MA: MIT Press.Google Scholar
Mohri, M. 1997b. Transducers in language and speech. Computational Linguistics 23 (2): 269311.Google Scholar
Mohri, M., Pereira, F., and Riley, M. 2000. The design principles of a weighted finite-state transducer library. Theoretical Computer Science (231): 17–32.Google Scholar
Mohri, M., and Sproat, R. W. 1996. An efficient compiler for weighted rewrite rules. In Joshi, A., and Palmer, M. (eds.) Proceedings of the annual Meeting of the Association for Computational Linguistics, pp. 231–8. Santa Cruz, CA: the Association for Computational Linguistics.Google Scholar
Och, F. J. 2003. Minimum error rate training in statistical machine translation. In Hinrichs, E. W., and Roth, D. (eds.) Proceedings of the 41st Annual Meeting of the Association for Computational Linguistics, pp. 160–7. Sapporo, Japan: the Association for Computational Linguistics.Google Scholar
Och, F. J., and Ney, H. 2003. A systematic comparison of various statistical alignment models. Computational Linguistics 29 (1): 1951.CrossRefGoogle Scholar
Papineni, K., Roukos, S., Ward, T., and Zhu, W.-J. 2001. Bleu: a method for automatic evaluation of machine translation. Technical Report RC22176 (W0109-022), IBM Research Division, Thomas J. Watson Research Center.Google Scholar
Roche, E., and Schabes, Y. 1997. Introduction to finite-state devices in natural language processing. In Roche, E., and Schabes, Y. (eds.), Finite State Natural Language Processing, pp. 166, Cambridge, MA: MIT Press.CrossRefGoogle Scholar
Simard, M., and Deslauriers, A. 2001. Real-time automatic insertion of accents in French text. Journal of Natural Language Engineering 7 (2): 143–65.CrossRefGoogle Scholar
Sproat, R., Black, A., Chen, S., Kumar, S., Ostendorf, M., and Richards, C. 2001. Normalization of non-standard words. Computer Speech and Language 15 (3): 287333.Google Scholar
Stolcke, A. 2002. SRILM – an extensible language modeling toolkit. In Proceedings of the International Conference on Spoken Langage Processing (ICSLP), vol. 2, pp. 901–4, Denver, CO.Google Scholar
Toutanova, K., and Moore, R. 2002. Pronunciation modeling for improved spelling correction. In Charniak, E., and Lin, D. (eds.) Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics, pp. 144–51. Philadelphia, PA: the Association for Computational Linguistics.Google Scholar
Véronis, J., and Guimier de Neef, E. 2006. Le traitement des nouvelles formes de communication écrite. In Sabah, G. (ed.), Compréhension automatique des langues et interaction, pp. 227–48. Paris: Hermès Science.Google Scholar
Yvon, F., de Mareüil, P. B., d'Alessandro, C., Aubergé, V., Bagein, M., Bailly, G., Béchet, F., Foukia, S., Goldman, J.-P., Keller, E., O'Shaughnessy, D., Pagel, V., Sannier, F., Véronis, J., and Zellner, B. 1998. Objective evaluation of grapheme-to-phoneme conversion for text-to-speech synthesis in French. Computer, Speech and Language 12 (4): 393410.Google Scholar
Zhu, C., Tang, J., Li, H., Ng, H. T., and Zhao, T. 2007. A unified tagging approach to text normalization. In Zaenen, A., and van den Bosch, A. (eds.) Proceedings of the 45th Annual Meeting of the Association of Computational Linguistics, pp. 688–95. Prague, Czech Republic: the Association for Computational Linguistics.Google Scholar