Efficiently generating correction suggestions for garbled tokens of historical language

ULRICH REFFLE

doi:10.1017/S1351324911000039

Efficiently generating correction suggestions for garbled tokens of historical language

Published online by Cambridge University Press: 21 March 2011

ULRICH REFFLE

Show author details

ULRICH REFFLE*: Affiliation:
Centrum f/4r Informations und Sprachverarbeitung, University of Munich, Germany email: [email protected]

Article contents

Abstract
References

Get access

Rights & Permissions

Abstract

Text correction systems rely on a core mechanism where suitable correction suggestions for garbled input tokens are generated. Current systems, which are designed for documents including modern language, use some form of approximate search in a given background lexicon. Due to the large amount of spelling variation found in historical documents, special lexica for historical language can only offer restricted coverage. Hence historical language is often described in terms of a matching procedure to be applied to modern words. Given such a procedure and a base lexicon of modern words, the question arises of how to generate correction suggestions for garbled historical variants. In this paper we suggest an efficient algorithm that solves this problem. The algorithm is used for postcorrection of optical character recognition results on historical document collections.

Type: Papers
Information: Natural Language Engineering , Volume 17 , Special Issue 2: Finite-State Methods and Models in Natural Language Processing , April 2011 , pp. 265 - 282

DOI: https://doi.org/10.1017/S1351324911000039 [Opens in a new window]
Copyright: Copyright © Cambridge University Press 2011

Access options

Get access to the full version of this content by using one of the access options below. (Log in options will check for institutional or personal access. Content may require purchase if you do not have access.)

References

Aho, A. V. and Corasick, M. J. 1975. Efficient string matching: an aid to bibliographic search. Communications of the ACM 18 (6): 333–40.CrossRef Google Scholar

Archer, D., Ernst-Gerlach, A., Kempen, S., Pilz, T., and Rayson, P. 2006. The identification of spelling variants in English and German historical texts: manual or automatic. In Proceedings of the Digital Humanities Conference, Paris, France, pp. 3–5.Google Scholar

Brill, E. and Moore, R. C. 2000. An improved error model for noisy channel spelling correction. In ACL '00: Proceedings of the 38th Annual Meeting on Association for Computational Linguistics, Morristown, NJ, USA. Association for Computational Linguistics, pp. 286–93.CrossRef Google Scholar

Bunke, H. 1993. A fast algorithm for finding the nearest neighbour of a word in a dictionary. In ICDAR '93: Proceedings of the 2nd International Conference on Document Analysis and Recognition, Washington DC, USA: IEEE Computer Society, pp. 632–37.Google Scholar

Ernst-Gerlach, A. and Fuhr, N. 2006. Generating search term variants for text collections with historic spellings. In ECIR '06: Proceedings of the 28th European Conference on Information Retrieval Research, Berlin: Springer.Google Scholar

Ernst-Gerlach, A. and Fuhr, N. 2007. Retrieval in text collections with historic spelling using linguistic and spelling variants. In JCDL '07: Proceedings of the 7th ACM/IEEE-CS joint conference on Digital libraries, New York, NY, USA: ACM, pp. 333–41.Google Scholar

Gotscharek, A., Neumann, A., Reffle, U., Ringlstetter, C., and Schulz, K. U. 2009a. Enabling information retrieval on historical document collections: the role of matching procedures and special lexica. In AND '09: Proceedings of The Third Workshop on Analytics for Noisy Unstructured Text Data, New York, NY, USA: ACM, pp. 69–76.CrossRef Google Scholar

Gotscharek, A., Reffle, U., Ringlstetter, C. and Schulz, K. U. 2009b. On lexical resources for digitization of historical documents. In DocEng '09: Proceedings of the 9th ACM symposium on Document engineering, New York, NY, USA: ACM, pp. 193–200.CrossRef Google Scholar

Hauser, A., Heller, M., Leiss, E., Schulz, K. U., and Wanzeck, C. 2006. Information access to historical documents from the early new high german period. In IJCAI '07: Workshop on Analytics for Noisy Unstructured Text Data.Google Scholar

Mihov, S. and Schulz, K. U. 2004, December. Fast approximate search in large dictionaries. Computational Linguistics 30 (4): 451–77.CrossRef Google Scholar

Navarro, G. and Raffinot, M. 2001. Flexible Pattern Matching in Strings. Cambridge University Press.Google Scholar

Oflazer, K. 1996. Error-tolerant finite-state recognition with applications to morphological analysis and spelling correction. Computational Linguistics 22 (1): 73–89.Google Scholar

Owolabi, O. and McGregor, D. 1988. Fast approximate string matching. Software - Practice and Experience 18 (4): 387–93.CrossRef Google Scholar

Pilz, T., Luther, W., Ammon, U. and Fuhr, N. 2005. Rule-based search in text databases with nonstandard orthography. In Proceedings of ACH/ALLC 2005, Victoria, BC, Canada.Google Scholar

Roche, E. and Schabes, Y. (eds.) 1997. Finite-State Language Processing. Bradford Book. Cambridge, MA, USA: The MIT Press.CrossRef Google Scholar

Schulz, K., Mihov, S. and Mitankin, P. 2007. Fast selection of small and precise candidate sets from dictionaries for text correction tasks. In ICDAR '07: Proceedings of the Ninth International Conference on Document Analysis and Recognition, Washington, DC, USA: IEEE Computer Society, pp. 471–475.Google Scholar

Schulz, K. U. and Mihov, S. 2002. Fast String Correction with Levenshtein-Automata. International Journal of Document Analysis and Recognition 5 (1): 67–85.Google Scholar

Wu, S. and Manber, U. 1992. Fast text searching allowing errors. Communications of the ACM 35 (10): 83–91.CrossRef Google Scholar

Article contents

Efficiently generating correction suggestions for garbled tokens of historical language

Abstract

Access options

References

Save article to Kindle

Save article to Dropbox

Save article to Google Drive

Reply to: Submit a response

Your details

You have entered the maximum number of contributors

Conflicting interests