Hostname: page-component-586b7cd67f-t8hqh Total loading time: 0 Render date: 2024-11-24T22:36:27.252Z Has data issue: false hasContentIssue false

Alignment of comparable documents: Comparison of similarity measures on French–English–Arabic data

Published online by Cambridge University Press:  19 June 2018

D. LANGLOIS
Affiliation:
SMarT Group, LORIA, INRIA, Villers-lès-Nancy F-54600, France e-mail: [email protected], [email protected] Université de Lorraine, LORIA, UMR 7503, Villers-lès-Nancy F-54600, France CNRS, LORIA, UMR 7503, Villers-lès-Nancy F-54600, France
M. SAAD
Affiliation:
Islamic University of Gaza, Department of Computer Sciences e-mail: [email protected]
K. SMAILI
Affiliation:
SMarT Group, LORIA, INRIA, Villers-lès-Nancy F-54600, France e-mail: [email protected], [email protected] Université de Lorraine, LORIA, UMR 7503, Villers-lès-Nancy F-54600, France CNRS, LORIA, UMR 7503, Villers-lès-Nancy F-54600, France

Abstract

The objective, in this article, is to address the issue of the comparability of documents, which are extracted from different sources and written in different languages. These documents are not necessarily translations of each other. This material is referred as multilingual comparable corpora. These language resources are useful for multilingual natural language processing applications, especially for low-resourced language pairs. In this paper, we collect different data in Arabic, English, and French. Two corpora are built by using available hyperlinks for Wikipedia and Euronews. Euronews is an aligned multilingual (Arabic, English, and French) corpus of 34k documents collected from Euronews website. A more challenging issue is to build comparable corpus from two different and independent media having two distinct editorial lines, such as British Broadcasting Corporation (BBC) and Al Jazeera (JSC). To build such corpus, we propose to use the Cross-Lingual Latent Semantic approach. For this purpose, documents have been harvested from BBC and JSC websites for each month of the years 2012 and 2013. The comparability is calculated for each Arabic–English couple of documents of each month. This automatic task is then validated by hand. This led to a multilingual (Arabic–English) aligned corpus of 305 pairs of documents (233k English words and 137k Arabic words). In addition, a study is presented in this paper to analyze the performance of three methods of the literature allowing to measure the comparability of documents on the multilingual reference corpora. A recall at rank 1 of 50.16 per cent is achieved with the Cross-lingual LSI approach for BBC–JSC test corpus, while the dictionary-based method reaches a recall of only 35.41 per cent.

Type
Article
Copyright
Copyright © Cambridge University Press 2018 

Access options

Get access to the full version of this content by using one of the access options below. (Log in options will check for institutional or personal access. Content may require purchase if you do not have access.)

References

Abidi, K., and Smaïli, K. 2018. An automatic learning of an algerian dialect lexicon by using multilingual word embeddings. In Proceedings of the 11th edition of the Language Resources and Evaluation Conference (LREC 2018), Miyazaki, Japan, European Language Resources Association.Google Scholar
Abdul-Rauf, S., and Schwenk, H. 2009. On the use of comparable corpora to improve SMT performance. In In the proceedings of the 12th Conference of the European Chapter of the Association for Computational Linguistics. Athens, Greece, pp. 16–23.Google Scholar
Abdul-Rauf, S., and Schwenk, H. 2011. Parallel sentence generation from comparable corpora for improved SMT. Machine Translation 25 (4): 341375.Google Scholar
Aljlayl, M., Frieder, O., and Grossman, D. 2002. On Arabic-English cross-language information retrieval: Machine translation approach. In Proceedings of the International Conference on Information Technology: Coding and Computing (ITCC ’02), Washington, DC, USA, IEEE Computer Society, p. 2.Google Scholar
Ballesteros, L., and Croft, B. 1996. Dictionary methods for cross-lingual information retrieval. In Proceedings of the International Conference on Database and Expert Systems Applications, Springer, pp. 791–801.Google Scholar
Blei, D. M., Ng, A. Y., and Jordan, M. I. 2003. Latent Dirichlet allocation. Journal of Machine Learning Research 3: 9931022.Google Scholar
Brown, P. F., Pietra, V. J. D., Pietra, S. A. D., and Mercer, R. L. 1993. The mathematics of statistical machine translation: Parameter estimation. Computational Linguistics 19 (2): 263311.Google Scholar
Cui, W., Liu, S., Tan, L., Shi, C., Song, Y., Gao, Z., Qu, H., and Tong, X. 2011. Textflow: Towards better understanding of evolving topics in text. IEEE Transactions on Visualization and Computer Graphics 17 (12): 2412–21.Google Scholar
Delpech, E. 2011. Evaluation of terminologies acquired from comparable corpora: an application perspective. In Proceedings of the 18th International Nordic Conference of Computational Linguistics (NODALIDA 2011), Riga, Latvia, pp. 6673.Google Scholar
Dhillon, P. S., Foster, D. P., and Ungar, L. H. 2015. Eigenwords: Spectral word embeddings. Journal of Machine Learning Research 16: 3035–78.Google Scholar
Etchegoyhen, T., and Azpeitia, A. 2016. A portable method for parallel and comparable document alignment. Baltic Journal of Modern Computing 4 (2): 243.Google Scholar
Fung, P., and Cheung, P. 2004. Multi-level bootstrapping for extracting parallel sentences from a quasi-comparable corpus. In Proceedings of the 20th International Conference on Computational Linguistics (COLING ’04), Stroudsburg, PA, USA, Association for Computational Linguistics, pp. 1051.Google Scholar
Harispe, S., Ranwez, S., Janaqi, S., and Montmain, J.,2015. Semantic similarity from natural language and ontology analysis. Synthesis Lectures on Human Language Technologies 8 (1): 1254.Google Scholar
Hieber, F., and Riezler, S. 2015. Bag-of-words forced decoding for cross-lingual information retrieval. InProceedings of the 2015 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Denver, Colorado, Association for Computational Linguistics, pp. 1172–1182.Google Scholar
Ion, R., Ceauşu, A., and Irimia, E. 2011. An expectation maximization algorithm for textual unit alignment. In Proceedings of the 4th Workshop on Building and Using Comparable Corpora: Comparable Corpora and the Web, Portland, Oregon, Association for Computational Linguistics, pp. 128–35.Google Scholar
Knoth, P., Zilka, L., and Zdrahal, Z. 2011. Using explicit semantic analysis for cross-lingual link discovery. In Proceedings of the 5th International Workshop on Cross Lingual Information Access (IJC-NLP 2011), Computational Linguistics and the Information Need of Multilingual Societies (CLIA), Chiang Mai, Thailand, pp. 210.Google Scholar
Li, B., and Gaussier, E. 2010. Improving corpus comparability for bilingual lexicon extraction from comparable corpora. In Proceedings of the 23rd International Conference on Computational Linguistics (COLING 2010), Beijing, China, COLING 2010 Organizing Committee, pp. 644652.Google Scholar
Li, B. 2012. Measuring and Improving Comparable Corpus Quality (Doctoral dissertation). France: University of Grenoble.Google Scholar
Littman, M. L., Dumais, S. T., and Landauer, T. K. 1998. Automatic cross-language information retrieval using latent semantic indexing. In Grefenstette, G. (ed.), Cross-Language Information Retrieval, pp. 5162. The Springer International Series on Information Retrieval, vol. 2, USA: Springer.Google Scholar
Morin, E., and Prochasson, E. 2011. Bilingual lexicon extraction from comparable corpora enhanced with parallel corpora. In Proceedings of the 4th Workshop on Building and Using Comparable Corpora: Comparable Corpora and the Web, Portland, Oregon, Association for Computational Linguistics, pp. 2734.Google Scholar
Morin, E., Hazem, A., Boudin, F., and Clouet, E. L. 2015. Lina: Identifying comparable documents from wikipedia. In Proceedings of the 8th Workshop on Building and Using Comparable Corpora (BUCC@ACL/IJCNLP 2015), Beijing, China, Association for Computational Linguistics, pp. 8891.Google Scholar
Munteanu, D. S., and Marcu, D. 2005. Improving machine translation performance by exploiting non-parallel corpora. Computational Linguistics 31 (4): 477504.Google Scholar
Oshikiri, T., Fukui, K., and Shimodaira, H. 2016. Cross-lingual word representations via spectral graph embeddings. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (ACL 2016), Berlin, Germany, vol. 2, Short Papers. Association for Computational Linguistics.Google Scholar
Otero, P., and López, I. 2011. Measuring comparability of multilingual corpora extracted from Wikipedia. In Iberian Cross-Language Natural Language Processing Tasks (ICL), Published by Paolo Rosso, Alberto Barrón-Cedeño, Marta Vila, Jorge Civera, Anabela Barreiro, Iñaki Alegria, p. 8.Google Scholar
Pinnis, M., Ion, R., Stefanescu, D., Su, F., Skadina, I., Vasiljevs, A., and Babych, B. 2012 Accurat toolkit for multi-level alignment and information extraction from comparable corpora. In Proceedings of the ACL 2012 System Demonstrations (ACL ’12), Stroudsburg, PA, USA, Association for Computational Linguistics, pp. 91–6.Google Scholar
Saad, M. 2015. Mining Documents and Sentiments in Cross-lingual Context. (Ph.D. thesis). Université de Lorraine.Google Scholar
Saad, M., Langlois, D., and Smaïli, K. 2013. Extracting comparable articles from wikipedia and measuring their comparabilities. Procedia – Social and Behavioral Sciences 95: 40–7. Alicante, Spain. Corpus Resources for Descriptive and Applied Studies. Current Challenges and Future Directions: Selected Papers from the 5th International Conference on Corpus Linguistics (CILC2013). Elsevier.Google Scholar
Saad, M., Langlois, D., and Smaïli, K. 2014. Cross-lingual semantic similarity measure for comparable articles. In Proceedings of the Advances in Natural Language Processing – 9th International Conference on NLP (PolTAL 2014), Warsaw, Poland, Springer International Publishing, pp. 105–15.Google Scholar
Sharoff, S., Zweigenbaum, P., and Rapp, R. 2015. Bucc shared task: cross-language document similarity. In Proceedings of the ACL-IJCNLP, 74. Association for Computational Linguistics.Google Scholar
Skadina, I., Aker, A., Mastropavlos, N., Su, F., Tufis, D., Verlic, M., Vasiljevs, A., Babych, B., Clough, P., Gaizauskas, R., Glaros, N., Paramita, M. L., and Pinnis, M. 2012. Collecting and using comparable corpora for statistical machine translation. In Proceedings of the 8th International Conference on Language Resources and Evaluation (LREC 2012), Istanbul, Turkey, European Language Resources Association, pp. 438445.Google Scholar
Smith, J., Quirk, C., and Toutanova, K. 2010. Extracting parallel sentences from comparable corpora using document level alignment. In Proceedings of the Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics, Association for Computational Linguistics, pp. 403–11.Google Scholar
Ture, F. 2013. Searching to Translate and Translating to Search: When Information Retrieval Meets Machine Translation. (Ph.D. thesis). Graduate School of the University of Maryland, College Park.Google Scholar
Vulić, I., and Moens, M.-F. 2014. Probabilistic models of cross-lingual semantic similarity in context based on latent cross-lingual concepts induced from comparable data. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP 2014), Association for Computational Linguistics (ACL), pp. 349–62.Google Scholar
Vulić, I., and Moens, M.-F. 2015. Monolingual and cross-lingual information retrieval models based on (bilingual) word embeddings. In Proceedings of the 38th International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR ’15, New York, NY, USA, Association for Computing Machinery, pp. 363–72.Google Scholar
Wołk, K., and Marasak, K. 2014. Building subject-aligned comparable corpora and mining it for truly parallel sentence pairs. Procedia Technology 18: 126132, Elsevier.Google Scholar