Exploiting unbalanced specialized comparable corpora for bilingual lexicon extraction†

EMMANUEL MORIN; AMIR HAZEM

doi:10.1017/S1351324916000140

Exploiting unbalanced specialized comparable corpora for bilingual lexicon extraction†

Published online by Cambridge University Press: 15 June 2016

EMMANUEL MORIN and

AMIR HAZEM

Show author details

EMMANUEL MORIN: Affiliation:
Université de Nantes, LINA UMR CNRS 6241, 2 rue de la houssinière, BP 92208, 44322 Nantes Cedex 03, France e-mails: [email protected], [email protected]
AMIR HAZEM: Affiliation:
Université de Nantes, LINA UMR CNRS 6241, 2 rue de la houssinière, BP 92208, 44322 Nantes Cedex 03, France e-mails: [email protected], [email protected]

Article contents

Abstract
Footnotes
References

Get access

Rights & Permissions

Abstract

The main work in bilingual lexicon extraction from comparable corpora is based on the implicit hypothesis that corpora are balanced in terms of size. However, the historical context-based projection method is relatively insensitive to the size of each part of the comparable corpus. Within this context, we have carried out a study on the influence of unbalanced specialized comparable corpora and on the quality of bilingual terminology extraction by doing different experiments. Moreover, we have introduced a strategy into the context-based projection method to re-estimate word co-occurrence observations. This is done by using smoothing or prediction techniques that boost the observations of word co-occurrences which are mainly useful for the smallest part of an unbalanced comparable corpus. Our results show that the use of unbalanced specialized comparable corpora results in a significant improvement in the quality of extracted lexicons.

Type: Articles
Information: Natural Language Engineering , Volume 22 , Issue 4: Machine Translation Using Comparable Corpora , July 2016 , pp. 575 - 601

DOI: https://doi.org/10.1017/S1351324916000140 [Opens in a new window]
Copyright: Copyright © Cambridge University Press 2016

Access options

Get access to the full version of this content by using one of the access options below. (Log in options will check for institutional or personal access. Content may require purchase if you do not have access.)

Footnotes

†

We thank the two anonymous reviewers whose comments and suggestions helped improve and clarify this manuscript. This work is supported by the French National Research Agency under grant ANR-12-CORD-0020.

References

Agresti, A. 2007. An Introduction to Categorical Data Analysis, 2nd ed.Hoboken, New Jersey: Wiley & Sons, Inc.Google Scholar

Bouamor, D., Semmar, N. and Zweigenbaum, P. 2013. Context vector disambiguation for bilingual lexicon extraction from comparable corpora. In Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (ACL'13), Sofia, Bulgaria, pp. 759–64.Google Scholar

Chen, S. F. and Goodman, J. 1999. An empirical study of smoothing techniques for language modeling. Computer Speech & Language 13 (4): 359–93.CrossRef Google Scholar

Chiao, Y.-C. and Zweigenbaum, P. 2002. Looking for candidate translational equivalents in specialized, comparable corpora. In Proceedings of the 19th International Conference on Computational Linguistics (COLING'02), Tapei, Taiwan, pp. 1208–12.Google Scholar

Chiao, Y.-C. and Zweigenbaum, P. 2003. The effect of a general lexicon in corpus-based identification of french-english medical word translations. In Baud, R., Fieschi, M., Le Beux, P., and Ruch, P. (eds.), The New Navigators: from Professionals to Patients, Actes Medica Informatics Europe, pp. 397–402. Studies in Health Technology and Informatics, vol. 95. Amsterdam: IOS Press.Google Scholar

Christensen, R. 1997. Log-Linear Models and Logistic Regression. Berlin: Springer-Verlag.Google Scholar

Déjean, H., Gaussier, É., and Sadat, F. 2002. An approach based on multilingual thesauri and model combination for bilingual lexicon extraction. In Proceedings of the 19th International Conference on Computational Linguistics (COLING'02), Taipei, Taiwan, pp. 1–7.Google Scholar

Diab, M. T. and Finch, S. 2000. A statistical word-level translation model for comparable corpora. In Proceedings of the 6th International Conference on Computer-Assisted Information Retrieval (RIAO'00), Paris, France, pp. 1500–01.Google Scholar

Dunning, T. 1993. Accurate methods for the statistics of surprise and coincidence. Computational Linguistics 19 (1): 61–74.Google Scholar

Evert, S. 2005. The Statistics of Word Cooccurrences: Word Pairs and Collocations. PhD Thesis, Dissertation, Institut für maschinelle Sprachverarbeitung, University of Stuttgart.Google Scholar

Evert, S. and Baroni, M. 2007. Zipfr: word frequency modeling in r. In Proceedings of the 45th Annual Meeting of the Association for Computational Linguistics (ACL'07), Prague, Czech Republic.Google Scholar

Fano, R. M. 1961. Transmission of Information: a Statistical Theory of Communications. Cambridge, MA, USA: MIT Press.Google Scholar

Firth, J. R. 1957. A synopsis of linguistic theory 1930–1955. In Studies in Linguistic Analysis (special volume of the Philological Society), pp. 1–32. Oxford: Blackwell.Google Scholar

Fung, P. 1995. Compiling bilingual lexicon entries from a non-parallel english-chinese corpus. In Proceedings of the 3rd Annual Workshop on Very Large Corpora (VLC'95), Cambridge, MA, USA, pp. 173–83.Google Scholar

Fung, P. 1998. A statistical view on bilingual lexicon extraction: from parallel corpora to non-parallel corpora. In Proceedings of the 3rd Conference of the Association for Machine Translation in the Americas on Machine Translation and the Information Soup (AMTA'98), Langhorne, PA, USA, pp. 1–17.Google Scholar

Fung, P. and Cheung, P. 2004. Mining very-non-parallel corpora: parallel sentence and lexicon extraction via bootstrapping and EM. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP'04), Barcelona, Spain, pp. 57–63.Google Scholar

Fung, P. and McKeown, K. 1997. Finding terminology translations from non-parallel corpora. In Proceedings of the 5th Annual Workshop on Very Large Corpora (VLC'97), Hong Kong, pp. 192–202.Google Scholar

Gamallo, P. 2007. Learning bilingual lexicons from comparable english and spanish corpora. In Proceedings of the 11th Conference on Machine Translation Summit (MT Summit XI), Copenhagen, Denmark, pp. 191–98.Google Scholar

Gaussier, E., Renders, J.-M., Matveeva, I., Goutte, C., and Déjean, H. (2004. A geometric view on bilingual lexicon extraction from comparable corpora. In Proceedings of the 42nd Annual Meeting of the Association for Computational Linguistics (ACL'04), Barcelona, Spain, pp. 526–33.CrossRef Google Scholar

Good, I. J. 1953. The population frequencies of species and the estimation of population parameters. Biometrika 40 (3/4): 237–64.Google Scholar

Grefenstette, G. 1994a. Corpus-derived first, second and third-order word affinities. In Proceedings of the 6th Congress of the European Association for Lexicography (EURALEX'94), Amsterdam, The Netherlands, pp. 279–90.Google Scholar

Grefenstette, G. 1994b. Explorations in Automatic Thesaurus Discovery. Boston, MA, USA: Kluwer Academic Publisher.CrossRef Google Scholar

Hazem, A. and Morin, E. 2012. Adaptive dictionary for bilingual lexicon extraction from comparable corpora. In Proceedings of the 8th International Conference on Language Resources and Evaluation (LREC'12), Istanbul, Turkey, pp. 288–92.Google Scholar

Hazem, A. and Morin, E. 2013. Word co-occurrence counts prediction for bilingual terminology extraction from comparable corpora. In Proceedings of the 6th International Joint Conference on Natural Language Processing (IJCNLP'13), Nagoya, Japan, pp. 1392–1400.Google Scholar

Hazem, A. and Morin, E. 2014. Improving bilingual lexicon extraction from comparable corpora using window-based and syntax-based models. In Proceedings of the 15th International Computational Linguistics and Intelligent Text Processing (CICLing'14), Kathmandu, Nepal, pp. 310–23.Google Scholar

Ismail, A. and Manandhar, S. 2010. Bilingual lexicon extraction from comparable corpora using in-domain terms. In Proceedings of the 23rd International Conference on Computational Linguistics (COLING'10), Beijing, China, pp. 481–89.Google Scholar

Jeffreys, H. 1948. Theory of Probability. Oxford: The Clarendon Press.Google Scholar

Johnson, W. 1932. Probability: the deductive and inductive problems. Mind 41 (164): 409–23.Google Scholar

Katz, S. M. 1987. Estimation of probabilities from sparse data for the language model component of a speech recognizer. IEEE Transactions on Acoustics, Speech and Signal Processing 35 (3): 400–01.Google Scholar

Kneser, R. and Ney, H. 1995. Improved backing-off for M-gram language modeling. In Proceedings of the 20th International Conference on Acoustics, Speech, and Signal Processing (ICASSP'95), Detroit, MI, USA, pp. 181–84.Google Scholar

Koehn, P. and Knight, K. 2002. Learning a translation lexicon from monolingual corpora. In Proceedings of the ACL-02 Workshop on Unsupervised Lexical Acquisition (ULA'02), Philadelphia, PA, USA, pp. 9–16.Google Scholar

Laroche, A. and Langlais, P. 2010. Revisiting context-based projection methods for term-translation spotting in comparable corpora. In Proceedings of the 23rd International Conference on Computational Linguistics (COLING'10), Beijing, China, pp. 617–25.Google Scholar

Li, B. and Gaussier, É. 2010. Improving corpus comparability for bilingual lexicon extraction from comparable corpora. In Proceedings of the 23rd International Conference on Computational Linguistics (COLING'10), Beijing, China, pp. 644–52.Google Scholar

Lidstone, G. J. 1920. Note on the general case of the bayes-laplace formula for inductive or a posteriori probabilities. Transactions of the Faculty of Actuaries 8: 182–92.Google Scholar

Manning, C. D., Raghavan, P. and Schütze, H. 2008. Introduction to Information Retrieval. New York, NY, USA: Cambridge University Press.Google Scholar

McEnery, A., and Xiao, Z. 2007. Parallel and comparable corpora: what are they up to? In Anderman, G., and Rogers, M. (eds.), Incorporating Corpora: Translation and the Linguist, Multilingual Matters, chapter 2, Clevedon, UK, pp. 18–31.Google Scholar

Mercer, L. and Jelinek, F. 1980. Interpolated estimation of markov source parameters from sparse data. In Workshop on Pattern Recognition in Practice, Amsterdam.Google Scholar

Morin, E., Daille, B., Takeuchi, K. and Kageura, K. 2007. Bilingual terminology mining – using brain, not brawn comparable corpora. In Proceedings of the 45th Annual Meeting of the Association for Computational Linguistics (ACL'07), Prague, Czech Republic, pp. 664–71.Google Scholar

Morin, E., Daille, B., Takeuchi, K. and Kageura, K. 2010. Brains, not brawn: the use of ‘smart’ comparable corpora in bilingual terminology mining. ACM Transactions on Speech and Language Processing 7 (1): 1–23.Google Scholar

Morin, E. and Hazem, A. 2014. Looking at unbalanced specialized comparable Corpora for bilingual lexicon extraction. In Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (ACL'14), Baltimore, Maryland, pp. 1284–93.Google Scholar

Morin, E. and Prochasson, E. 2011. Bilingual lexicon extraction from comparable corpora enhanced with parallel corpora. In Proceedings of the 4th Workshop on Building and Using Comparable Corpora (BUCC'11), Portland, OR, USA, pp. 27–34.Google Scholar

Pekar, V., Mitkov, R., Blagoev, D. and Mulloni, A. 2006. Finding translations for low-frequency words in comparable corpora. Machine Translation 20 (4): 247–66.Google Scholar

Prochasson, E. and Fung, P. 2011. Rare word translation extraction from aligned comparable documents. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics (ACL'11), Portland, OR, USA, pp. 1327–35.Google Scholar

Prochasson, E., Morin, E. and Kageura, K. 2009. Anchor points for bilingual lexicon extraction from small comparable corpora. In Proceedings of the 12th Conference on Machine Translation Summit (MT Summit XII), Ottawa, Canada, pp. 284–91.Google Scholar

Rapp, R. 1995. Identify word translations in non-parallel texts. In Proceedings of the 35th Annual Meeting of the Association for Computational Linguistics (ACL'95), Boston, MA, USA, pp. 320–22.Google Scholar

Rapp, R. 1999. Automatic identification of word translations from unrelated english and german corpora. In Proceedings of the 37th Annual Meeting of the Association for Computational Linguistics (ACL'99), College Park, MD, USA, pp. 519–26.Google Scholar

Rubino, R. and Linarès, G. 2011. A multi-view approach for term translation spotting. In Proceedings of the 12th International Conference on Computational Linguistics and Intelligent Text Processing (CICLing'11), Tokyo, Japan, pp. 29–40.Google Scholar

Salton, G. and Lesk, M. E. 1968. Computer evaluation of indexing and text processing. Journal of the Association for Computational Machinery 15 (1): 8–36.Google Scholar

Sinclair, J. 2005. Corpus and text - basic principles. In Wynne, M. (ed.), Developing Linguistic Corpora: a Guide to Good Practice, pp. 1–16. Oxford: Oxbow Books. Available online from ota.ox.ac.uk/documents/creating/dlc/ [Accessed 2015-03-03].Google Scholar

Tanaka, K. and Iwasaki, H. 1996. Extraction of lexical translations from non-aligned corpora. In Proceedings of the 16th International Conference on Computational Linguistics (COLING'96), Copenhagen, Denmark, pp. 580–85.Google Scholar

Yu, K. and Tsujii, J. 2009. Extracting bilingual dictionary from comparable corpora with dependency heterogeneity. In Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL-HLT'09), Boulder, CO, USA, pp. 121–24.Google Scholar

Zipf, G. K. 1949. Human Behaviour and the Principle of Least Effort: An Introduction to Human Ecology. Cambridge, MA: Addison-Wesley.Google Scholar

Article contents

Exploiting unbalanced specialized comparable corpora for bilingual lexicon extraction†

Abstract

Access options

Footnotes

References

Save article to Kindle

Save article to Dropbox

Save article to Google Drive

Reply to: Submit a response

Your details

You have entered the maximum number of contributors

Conflicting interests