Hostname: page-component-745bb68f8f-hvd4g Total loading time: 0 Render date: 2025-01-10T10:46:04.793Z Has data issue: false hasContentIssue false

An information-theoretic, vector-space-model approach to cross-language information retrieval*

Published online by Cambridge University Press:  05 January 2011

PETER A. CHEW
Affiliation:
Moss Adams LLP, Albuquerque, NM 87110-4189, USA e-mail: [email protected]
BRETT W. BADER
Affiliation:
Sandia National Laboratories, Albuquerque, NM 87185-0519, USA emails: [email protected], [email protected]
STEPHEN HELMREICH
Affiliation:
New Mexico State University, New Mexico, 88003-8001, USA emails: [email protected], [email protected]
AHMED ABDELALI
Affiliation:
New Mexico State University, New Mexico, 88003-8001, USA emails: [email protected], [email protected]
STEPHEN J. VERZI
Affiliation:
Sandia National Laboratories, Albuquerque, NM 87185-0519, USA emails: [email protected], [email protected]

Abstract

In this article, we demonstrate several novel ways in which insights from information theory (IT) and computational linguistics (CL) can be woven into a vector-space-model (VSM) approach to information retrieval (IR). Our proposals focus, essentially, on three areas: pre-processing (morphological analysis), term weighting, and alternative geometrical models to the widely used term-by-document matrix. The latter include (1) PARAFAC2 decomposition of a term-by-document-by-language tensor, and (2) eigenvalue decomposition of a term-by-term matrix (inspired by Statistical Machine Translation). We evaluate all proposals, comparing them to a ‘standard’ approach based on Latent Semantic Analysis, on a multilingual document clustering task. The evidence suggests that proper consideration of IT within IR is indeed called for: in all cases, our best results are achieved using the information-theoretic variations upon the standard approach. Furthermore, we show that different information-theoretic options can be combined for still better results. A key function of language is to encode and convey information, and contributions of IT to the field of CL can be traced back a number of decades. We think that our proposals help bring IR and CL more into line with one another. In our conclusion, we suggest that the fact that our proposals yield empirical improvements is not coincidental given that they increase the theoretical transparency of VSM approaches to IR; on the contrary, they help shed light on why aspects of these approaches work as they do.

Type
Papers
Copyright
Copyright © Cambridge University Press 2011

Access options

Get access to the full version of this content by using one of the access options below. (Log in options will check for institutional or personal access. Content may require purchase if you do not have access.)

References

Bader, B. W., Berry, M. W., and Browne, M. 2008. Discussion tracking in Enron email using PARAFAC. In Berry, M. W. and Castellanos, M. (eds.), Survey of Text Mining: Clustering, Classification, and Retrieval, Second Edition, pp. 147162. London: Springer.CrossRefGoogle Scholar
Bader, B. W., Berry, M. W., and Langville, A. N. 2009. Text analysis using nonnegative matrix/tensor factorizations. In Srivastava, A. and Sahami, M. (eds.), Text Mining: Classification, Clustering and Applications, pp. 95120. Chapman & Hall/CRC.CrossRefGoogle Scholar
Bader, B., and Chew, P. 2008. Enhancing multilingual Latent Semantic Analysis with term alignment information. In Proceedings of the 22nd International Conference on Computational Linguistics (COLING 2008), August 2008, Manchester, England, pp. 4956.CrossRefGoogle Scholar
Baeza-Yates, R., and Ribeiro-Neto, B. 1999. Modern Information Retrieval. New York: ACM Press.Google Scholar
Biola University 2005–2006. The Unbound Bible. Retrieved on January 29, 2008, from http://www.unboundbible.org/Google Scholar
Boyack, K., Klavans, R., and Börner, K. 2005. Mapping the backbone of science. Scientometrics 64 (3): 351374.CrossRefGoogle Scholar
Broe, M. 1996. A generalized information-theoretic measure for systems of phonological classification and recognition. In Proceedings of the Second Meeting of the ACL Special Interest Group in Computational Phonology, July 1996, Santa Cruz, California, pp. 1724.Google Scholar
Brown, P. F., Della Pietra, V. J., Della Pietra, S. A., and Mercer, R. L. 1994. The mathematics of Statistical Machine Translation: parameter estimation. Computational Linguistics 19 (2): 263311.Google Scholar
Brown, P. F., deSouza, P. V., Mercer, R. L., Della Pietra, V. J., and Lai, J. C. 1992. Class-based n-gram models of natural language. Computational Linguistics 18 (4): 467479.Google Scholar
Bullinaria, J. A., and Levy, J. P. 2007. Extracting semantic representations from word co-occurrence statistics: a computational study. Behavior Research Methods 39: 510526.CrossRefGoogle ScholarPubMed
Cherry, E., Halle, M., and Jakobson, R. 1953. Toward the logical description of languages in their phonemic aspect. Language 29: 3446.Google Scholar
Chew, P., and Abdelali, A. 2007. Benefits of the ‘massively parallel Rosetta Stone’: cross-language information retrieval with over 30 languages. In Proceedings of the 45th Annual Meeting of the Association for Computational Linguistics, June 2007, Prague, Czech Republic, pp. 872879.Google Scholar
Chew, P. A., Bader, B. W., Kolda, T. G., and Abdelali, A. 2007. Cross-language information retrieval using PARAFAC2. In KDD '07: Proceedings of the 13th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 143152. New York: ACM Press.CrossRefGoogle Scholar
Chew, P., Kegelmeyer, P., Bader, B., and Abdelali, A. 2008. The knowledge of good and evil: multilingual ideology classification with PARAFAC2 and machine learning. Language Forum 34 (1): 3752.Google Scholar
Chisholm, E., and Kolda, T. G. 1999. New term weighting formulas for the vector space method in information retrieval. Technical Report ORNL-TM-13756, Oak Ridge National Laboratory, Oak Ridge, TN.CrossRefGoogle Scholar
Chomsky, N. 1956. Three models for the description of language. IRE Transactions on Information Theory 2: 113124.CrossRefGoogle Scholar
Chomsky, N., and Halle, M. 1968. The Sound Pattern of English. New York: Harper & Row.Google Scholar
Cleverdon, C. W. 1991. The significance of the Cranfield tests on index languages. In Proceedings of SIGIR, pp. 312. New York: ACM Press.Google Scholar
Deerwester, S. C., Dumais, S. T., Landauer, T. K., Furnas, G. W., and Harshman, R. A. 1990. Indexing by Latent Semantic Analysis. Journal of the American Society for Information Science 41 (6): 391407.Google Scholar
Dumais, S. 1991. Improving the retrieval of information from external sources. Behavior Research Methods, Instruments and Computers 23: 229236.CrossRefGoogle Scholar
Eckart, G., and Young, G. 1936. The approximation of one matrix by another of lower rank. Psychometrika 1 (3): 211218.CrossRefGoogle Scholar
Goldsmith, J. 2001. Unsupervised learning of the morphology of a natural language. Computational Linguistics 27 (2): 153198.CrossRefGoogle Scholar
Golub, G. H., and Van Loan, C. F. 1996. Matrix Computations. Baltimore, MD: Johns Hopkins University Press.Google Scholar
Halle, M. 1959. The Sound Pattern of Russian. The Hague, Netherlands: Mouton.Google Scholar
Hendrickson, B. 2007. Latent Semantic Analysis and Fiedler retrieval. Linear Algebra and its Applications 421 (2–3): 345355.Google Scholar
Hockett, C. 1958. A Course in Modern Linguistics. New York: Macmillan.Google Scholar
Kashioka, H., Kawata, Y., Kinjo, Y., Finch, A., and Black, E. W. 1998. Use of mutual information based character clusters in dictionary-less morphological analysis of japanese. In Proceedings of the 36th Annual Meeting of the Association for Computational Linguistics, August 1998, Montreal, Quebec, pp. 658662.Google Scholar
Kolda, T. G., and Bader, B. W. 2009. Tensor decompositions and applications. SIAM Review 51 (3): 455500.Google Scholar
Landauer, T., and Dumais, S. 1997. A solution to Plato's problem: the Latent Semantic Analysis theory of acquisition, induction, and representation of knowledge. Psychological Review 104 (2): 211240.CrossRefGoogle Scholar
Landauer, T., Foltz, P., and Laham, D. 1998. An introduction to Latent Semantic Analysis. Discourse Processes 25: 259284.CrossRefGoogle Scholar
Lin, D. 1999. Automatic identication of noncompositional phrases. In Proceedings of the 37th Annual Meeting of the Association for Computational Linguistics, June 1999, College Park, Maryland, pp. 317324.Google Scholar
Liu, N., Zhang, B., Yan, J., Chen, Z., Liu, W., Bai, F., and Chien, L. 2005. Text representation: from vector to tensor. In Proceedings of the 5th IEEE International Conference on Data Mining, November 2005, Houston, Texas, pp. 725728.Google Scholar
Lovins, J. B. 1968. Development of a stemming algorithm. Mechanical Translation and Computational Linguistics 11: 2231.Google Scholar
Matveeva, I., Levow, G.-A., Farahat, A., and Royer, C. 2005. Term representation with generalized Latent Semantic Analysis. Paper presented at the International Conference on Recent Advances in Natural Language Processing (RANLP-05), September 2005, Borovets, Bulgaria. http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.110.2216&rep=rep1&type=pdf.Google Scholar
Recchia, G., and Jones, M. N. 2009. More data trumps smarter algorithms: comparing pointwise mutual information with Latent Semantic Analysis. Behavior Research Methods 41: 647656.Google Scholar
Resnik, P., Broman Olsen, M., and Diab, M. 1999. The Bible as a parallel corpus: annotating the “Book of 2000 Tongues”. Computers and the Humanities 33: 129153.CrossRefGoogle Scholar
Rissanen, J. 1989. Stochastic Complexity in Statistical Inquiry. Singapore: World Scientific Publishing.Google Scholar
Salton, G. 1991. Developments in automatic text retrieval. Science 253: 974980.CrossRefGoogle ScholarPubMed
Salton, G., and Buckley, C. 1988. Term weighting approaches in automatic text retrieval. Information Processing and Management 24 (5): 513523.CrossRefGoogle Scholar
Shannon, C. E. 1948. A mathematical theory of communication. Bell System Technical Journal 27: 379423 and 623–656.CrossRefGoogle Scholar
Sparck Jones, K. 1972. A statistical interpretation of term specificity and its applications to retrieval. Journal of Documentation 28: 1121.Google Scholar
Swanson, D. 1988. Historical note: information retrieval and the future of an illusion. Journal of the American Society for Information Science 39 (2): 9298.3.0.CO;2-P>CrossRefGoogle Scholar
Tomlinson, S. 2004. Finnish, Portuguese and Russian retrieval with Hummingbird SearchServer at CLEF 2004. In Working Notes for the Cross-Language Evaluation Forum (CLEF) 2004 Workshop, Bath, England. Accessed on September 15, 2010 at http://www.clef-campaign.org/2004/working_notes/WorkingNotes2004/21.pdf.Google Scholar
Voorhees, E., and Harmaneds, D. 2005. TREC: Experiment and Evaluation in Information Retrieval. Cambridge, MA: MIT Press.Google Scholar
Weaver, W. 1955. Translation (1949). Cambridge, MA: MIT Press.Google Scholar
Young, P. 1994. Cross Language Information Retrieval Using Latent Semantic Indexing. Master's thesis, University of Knoxville, Knoxville, TN.Google Scholar