Article contents
Exploiting unbalanced specialized comparable corpora for bilingual lexicon extraction†
Published online by Cambridge University Press: 15 June 2016
Abstract
The main work in bilingual lexicon extraction from comparable corpora is based on the implicit hypothesis that corpora are balanced in terms of size. However, the historical context-based projection method is relatively insensitive to the size of each part of the comparable corpus. Within this context, we have carried out a study on the influence of unbalanced specialized comparable corpora and on the quality of bilingual terminology extraction by doing different experiments. Moreover, we have introduced a strategy into the context-based projection method to re-estimate word co-occurrence observations. This is done by using smoothing or prediction techniques that boost the observations of word co-occurrences which are mainly useful for the smallest part of an unbalanced comparable corpus. Our results show that the use of unbalanced specialized comparable corpora results in a significant improvement in the quality of extracted lexicons.
- Type
- Articles
- Information
- Natural Language Engineering , Volume 22 , Issue 4: Machine Translation Using Comparable Corpora , July 2016 , pp. 575 - 601
- Copyright
- Copyright © Cambridge University Press 2016
Footnotes
We thank the two anonymous reviewers whose comments and suggestions helped improve and clarify this manuscript. This work is supported by the French National Research Agency under grant ANR-12-CORD-0020.
References
- 4
- Cited by