Rewriting the orthography of SMS messages

FRANÇOIS YVON

doi:10.1017/S1351324909990258

Rewriting the orthography of SMS messages

Published online by Cambridge University Press: 24 March 2010

FRANÇOIS YVON

Show author details

FRANÇOIS YVON*: Affiliation:
LIMSI-CNRS and Université Paris Sud 11, Paris, France e-mail: [email protected]

Article contents

Abstract
References

Get access

Rights & Permissions

Abstract

Electronic written texts used in computer-mediated interactions (emails, blogs, chats, and the like) contain significant deviations from the norm of the language. This paper presents the detail of a system aiming at normalizing the orthography of French SMS messages: after discussing the linguistic peculiarities of these messages and possible approaches to their automatic normalization, we present, compare, and evaluate various instanciations of a normalization device based on weighted finite-state transducers. These experiments show that using an intermediate phonemic representation and training, our system outperforms an alternative normalization system based on phrase-based statistical machine translation techniques.

Type: Papers
Information: Natural Language Engineering , Volume 16 , Issue 2 , April 2010 , pp. 133 - 159

DOI: https://doi.org/10.1017/S1351324909990258 [Opens in a new window]
Copyright: Copyright © Cambridge University Press 2010

Access options

Get access to the full version of this content by using one of the access options below. (Log in options will check for institutional or personal access. Content may require purchase if you do not have access.)

Article purchase

Temporarily unavailable

References

Allauzen, C., Mohri, M., and Roark, B. 2005. The design principles and algorithms of a weighted grammar library. International Journal of Foundations of Computer Science 16 (3): 403–21.CrossRef Google Scholar

Anis, J. 2001. Parlez-vous texto? Guide des nouveaux langages du réseau. Paris, France: Éditions du Cherche Midi.Google Scholar

Aw, A., Zhang, M., Xiao, J., and Su, J. 2006. A phrase-based statistical model for SMS text normalization. In Calzolari, N., Cardie, C., and Isabelle, P. (eds.) Proceedings of COLING/Association for Computational Linguistics, pp. 33–40. Sydney, Australia: the Association for Computational Linguistics.Google Scholar

Bazzi, I., and Glass, J. 2000. Modelling OOV words for robust speech recognition. In Proceedings of the Internatinal Conference on Spoken Language Processing (ICSLP), pp. 401–4. Beijing, China: International Speech Communication Association (ISCA).Google Scholar

Beaufort, R., Roekhaut, S., and Fairon, C. 2008. Définition d'un système d'alignement SMS/français standard à l'aide d'un filtre de composition. In Heiden, S., and Pincemin, B. (eds.) Actes des Journées Internationales de l'Analyse des Données Textuelles (JADT), pp. 55–166. Lyon: Presses Universitaires de Lyon.Google Scholar

Béchet, F. 2001. LIA_PHON: Un système complet de phonétisation de textes. Traitement Automatique des Langues 42 (1): 47–67.Google Scholar

Boula de Mareüil, P., d'Alessandro, C., Yvon, F., Aubergé, V., Vaissière, J., and Amelot, A. 2000. A French phonetic lexicon with variants for speech and language processing. In Proceedings of the 2nd Language Resources Engineering Conference (LREC), vol. I, pp. 273–6, Athens, Greece.Google Scholar

Brill, E., and Moore, R. C. 2000. An improved error model for noisy channel spelling correction. In Huang, C.-N., and Vijay-Shankar, K., K..(eds.) Proceedings of the 38th Annual Meeting of the Association for Computational Linguistics, pp. 286–93. Hong Kong: the Association for Computational Linguistics.Google Scholar

Brown, P. F., Cocke, J., Pietra, S. D., Pietra, V. J. D., Jelinek, F., Lafferty, J. D., Mercer, R. L., and Roossin, P. S. 1990. A statistical approach to machine translation. Computational Linguistics 16 (2): 79–85.Google Scholar

Chen, S. F., and Goodman, J. T. 1996. An empirical study of smoothing techniques for language modeling. In Joshi, A., and Palmer, M. (eds.) Proceedings of the 34th Annual Meeting of the Association for Computational Linguistics, pp. 310–18. Santa Cruz, NM: the Association for Computational Linguistics.CrossRef Google Scholar

Choudhury, M., Saraf, R., Jain, V., Sarkar, S., and Basu, A. 2007. Investigation and modeling of the structure of texting language. In Knoblock, C., Lopresti, D., Roy, S., and Subramaniam, L. Venkata (eds.) Proceedings of the IJCAI Workshop on ‘Analytics for Noisy Unstructured Text Data’, pp. 63–70. Hyderabad, India: International Association for Pattern Recognition.Google Scholar

Church, K. W., and Gale, W. 1991. Probability scoring for spelling correction. Statistics and Computing 1: 91–103.CrossRef Google Scholar

Clark, A. 2003. Pre-processing very noisy text. In Proceedings of Workshop on Shallow Processing of Large Corpora, Lancaster, UK.Google Scholar

Crystal, D. 2001. Language and the Internet. Cambridge, UK: Cambridge University Press.CrossRef Google Scholar

Divay, M., and Vitale, A. J. 1997. Algorithm for grapheme-to-phoneme translation for French and English: applications. Computational Linguistics 23 (4): 495–524.Google Scholar

Eisner, J. 2002. Parameter estimation for probabilistic finite-state transducers. In Charniak, E., and Lin, D. (eds.) Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics, pp. 1–8. Philadelphia, PA: the Association for Computational Linguistics.Google Scholar

Fairon, C., Klein, J. R., and Paumier, S. 2006. Le langage SMS. Louvain, Belgium: UCL Presses.Google Scholar

Fairon, C., and Paumier, S. 2006. A translated corpus of 30,000 French SMS. In Proceedings of LREC 2006, Genoa, Italy.Google Scholar

Gillick, L., and Cox, S. 1989. Some statistical issues in the comparison of speech recognition algorithm. In Proceedings of the IEEE Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 532–35, Glasgow, UK.Google Scholar

Golding, A. R., and Roth, D. 1999. A winnow-based approach to context-sensitive spelling correction. Machine Learning 34: 107–30.Google Scholar

Golding, A. R., and Schabes, Y. 1996. Combining trigram-based and feature-based methods for context-sensitive spelling correction. In Joshi, A., and Palmer, M. (eds.) the Proceedings of the 34th Annual Meeting of the Association for Computational Linguistics, pp. 71–8. Santa Cruz, CA: the Association for Computational Linguistics.Google Scholar

Guimier de Neef, E., Debeurme, A., and Park, J. 2007. TILT correcteur de SMS: évaluation et bilan quantitatif. In Actes de la Conférence sur le Traitement Automatique des Langues (TALN'07), pp. 123–32, Toulouse, France.Google Scholar

Jansche, M. 2003. Inference of string mappings for language technology. PhD thesis, Ohio State University.Google Scholar

Jelinek, F. 1990. Self-organized language modeling for speech recognition. In Waibel, A., and Lee, K.-F. (eds.), Readings in Speech Recognition, pp. 450–506, San Mateo, CA: Morgan-Kaufman.Google Scholar

Kaplan, R. M., and Kay, M. 1994. Regular models of phonological rule systems. Computational Linguistics 20 (3): 331–78. (First appeared as a paper presented to the Winter Meeting of the Linguistic Society of America, New York, 1981.).Google Scholar

Kobus, C., Yvon, F., and Damnati, G. 2008a. Normalizing SMS: are two metaphors better than one? In Scott, D., and Uszkoreit, H. (eds.) Proceedings of the 22nd International Conference on Computational Linguistics (COLING 2008), pp. 441–8. Manchester, UK: the Association for Computational Linguistics.Google Scholar

Kobus, C., Yvon, F., and Damnati, G. 2008b. Transcrire les SMS comme on reconnaît la parole. In Actes de la Conférence sur le Traitement Automatique des Langues (TALN'08), pp. 128–38, Avignon, France.Google Scholar

Koehn, P., Hoang, H., Birch, A., Callison-Burch, C., Federico, M., Bertoldi, N., Cowan, B., Shen, W., Moran, C., Zens, R., Dyer, C., Bojar, O., Constantin, A., and Herbst, E. 2007. Moses: open source toolkit for statistical machine translation. In Proceedings of the Annual Meeting of the Association for Computational Linguistics, Demonstration Session, Prague, Czech Republic.Google Scholar

Koehn, P., Och, F. J., and Marcu, D. 2003. Statistical phrase-based translation. In Hearst, M., and Ostendorf, M. (eds.) Proceedings of the Human Language Technology Conference of the North American Chapter of the Association for Computational Linguistic, pp. 127–33. Edmondton, AB, Canada: the Association for Computational Linguistics.Google Scholar

Kukich, K. 1992. Techniques for automatically correcting words in text. Computing Surveys 24 (4): 377–439.Google Scholar

Lita, L. V., Ittycheriah, A., Roukos, S., and Kambhatla, N. 2003. tRuEcasIng. In Hinrichs, E. W., and Roth, D. (eds.) Proceedings of the 41st Annual Meeting on Association for Computational Linguistics, pp. 152–9. Sapporo, Japan: the Association for Computational Linguistics.Google Scholar

Mikheev, A. 2000. Document centered approach to text normalization. In SIGIR ‘00: Proceedings of the 23rd annual international ACM SIGIR conference on Research and development in information retrieval, pp. 136–43. New York: Association for Computing Machinery.Google Scholar

Mitton, R. 1996. Spellchecking by computer. Journal of the Simplified Spelling Society 20 (1): 4–11.Google Scholar

Mohri, M. 1997a. On the use of sequential transducers in natural language processing. In Roche, E., and Schabes, Y. (eds.), Finite State Natural Language Processing, Cambridge, MA: MIT Press.Google Scholar

Mohri, M. 1997b. Transducers in language and speech. Computational Linguistics 23 (2): 269–311.Google Scholar

Mohri, M., Pereira, F., and Riley, M. 2000. The design principles of a weighted finite-state transducer library. Theoretical Computer Science (231): 17–32.Google Scholar

Mohri, M., and Sproat, R. W. 1996. An efficient compiler for weighted rewrite rules. In Joshi, A., and Palmer, M. (eds.) Proceedings of the annual Meeting of the Association for Computational Linguistics, pp. 231–8. Santa Cruz, CA: the Association for Computational Linguistics.Google Scholar

Och, F. J. 2003. Minimum error rate training in statistical machine translation. In Hinrichs, E. W., and Roth, D. (eds.) Proceedings of the 41st Annual Meeting of the Association for Computational Linguistics, pp. 160–7. Sapporo, Japan: the Association for Computational Linguistics.Google Scholar

Och, F. J., and Ney, H. 2003. A systematic comparison of various statistical alignment models. Computational Linguistics 29 (1): 19–51.CrossRef Google Scholar

Papineni, K., Roukos, S., Ward, T., and Zhu, W.-J. 2001. Bleu: a method for automatic evaluation of machine translation. Technical Report RC22176 (W0109-022), IBM Research Division, Thomas J. Watson Research Center.Google Scholar

Roche, E., and Schabes, Y. 1997. Introduction to finite-state devices in natural language processing. In Roche, E., and Schabes, Y. (eds.), Finite State Natural Language Processing, pp. 1–66, Cambridge, MA: MIT Press.CrossRef Google Scholar

Simard, M., and Deslauriers, A. 2001. Real-time automatic insertion of accents in French text. Journal of Natural Language Engineering 7 (2): 143–65.CrossRef Google Scholar

Sproat, R., Black, A., Chen, S., Kumar, S., Ostendorf, M., and Richards, C. 2001. Normalization of non-standard words. Computer Speech and Language 15 (3): 287–333.Google Scholar

Stolcke, A. 2002. SRILM – an extensible language modeling toolkit. In Proceedings of the International Conference on Spoken Langage Processing (ICSLP), vol. 2, pp. 901–4, Denver, CO.Google Scholar

Toutanova, K., and Moore, R. 2002. Pronunciation modeling for improved spelling correction. In Charniak, E., and Lin, D. (eds.) Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics, pp. 144–51. Philadelphia, PA: the Association for Computational Linguistics.Google Scholar

Véronis, J., and Guimier de Neef, E. 2006. Le traitement des nouvelles formes de communication écrite. In Sabah, G. (ed.), Compréhension automatique des langues et interaction, pp. 227–48. Paris: Hermès Science.Google Scholar

Yvon, F., de Mareüil, P. B., d'Alessandro, C., Aubergé, V., Bagein, M., Bailly, G., Béchet, F., Foukia, S., Goldman, J.-P., Keller, E., O'Shaughnessy, D., Pagel, V., Sannier, F., Véronis, J., and Zellner, B. 1998. Objective evaluation of grapheme-to-phoneme conversion for text-to-speech synthesis in French. Computer, Speech and Language 12 (4): 393–410.Google Scholar

Zhu, C., Tang, J., Li, H., Ng, H. T., and Zhao, T. 2007. A unified tagging approach to text normalization. In Zaenen, A., and van den Bosch, A. (eds.) Proceedings of the 45th Annual Meeting of the Association of Computational Linguistics, pp. 688–95. Prague, Czech Republic: the Association for Computational Linguistics.Google Scholar

Article contents

Rewriting the orthography of SMS messages

Abstract

Access options

Article purchase

Temporarily unavailable

References

Save article to Kindle

Save article to Dropbox

Save article to Google Drive

Reply to: Submit a response

Your details

You have entered the maximum number of contributors

Conflicting interests