Alignment of comparable documents: Comparison of similarity measures on French–English–Arabic data

D. LANGLOIS; M. SAAD; K. SMAILI

doi:10.1017/S1351324918000232

Alignment of comparable documents: Comparison of similarity measures on French–English–Arabic data

Published online by Cambridge University Press: 19 June 2018

and

D. LANGLOIS: Affiliation:
SMarT Group, LORIA, INRIA, Villers-lès-Nancy F-54600, France e-mail: [email protected], [email protected] Université de Lorraine, LORIA, UMR 7503, Villers-lès-Nancy F-54600, France CNRS, LORIA, UMR 7503, Villers-lès-Nancy F-54600, France
M. SAAD: Affiliation:
Islamic University of Gaza, Department of Computer Sciences e-mail: [email protected]
K. SMAILI: Affiliation:
SMarT Group, LORIA, INRIA, Villers-lès-Nancy F-54600, France e-mail: [email protected], [email protected] Université de Lorraine, LORIA, UMR 7503, Villers-lès-Nancy F-54600, France CNRS, LORIA, UMR 7503, Villers-lès-Nancy F-54600, France

Article contents

Abstract
References

Get access

Rights & Permissions

Abstract

The objective, in this article, is to address the issue of the comparability of documents, which are extracted from different sources and written in different languages. These documents are not necessarily translations of each other. This material is referred as multilingual comparable corpora. These language resources are useful for multilingual natural language processing applications, especially for low-resourced language pairs. In this paper, we collect different data in Arabic, English, and French. Two corpora are built by using available hyperlinks for Wikipedia and Euronews. Euronews is an aligned multilingual (Arabic, English, and French) corpus of 34k documents collected from Euronews website. A more challenging issue is to build comparable corpus from two different and independent media having two distinct editorial lines, such as British Broadcasting Corporation (BBC) and Al Jazeera (JSC). To build such corpus, we propose to use the Cross-Lingual Latent Semantic approach. For this purpose, documents have been harvested from BBC and JSC websites for each month of the years 2012 and 2013. The comparability is calculated for each Arabic–English couple of documents of each month. This automatic task is then validated by hand. This led to a multilingual (Arabic–English) aligned corpus of 305 pairs of documents (233k English words and 137k Arabic words). In addition, a study is presented in this paper to analyze the performance of three methods of the literature allowing to measure the comparability of documents on the multilingual reference corpora. A recall at rank 1 of 50.16 per cent is achieved with the Cross-lingual LSI approach for BBC–JSC test corpus, while the dictionary-based method reaches a recall of only 35.41 per cent.

Type: Article
Information: Natural Language Engineering , Volume 24 , Issue 5 , September 2018 , pp. 677 - 694

DOI: https://doi.org/10.1017/S1351324918000232 [Opens in a new window]
Copyright: Copyright © Cambridge University Press 2018

Access options

Get access to the full version of this content by using one of the access options below. (Log in options will check for institutional or personal access. Content may require purchase if you do not have access.)

Article purchase

Temporarily unavailable

References

Abidi, K., and Smaïli, K. 2018. An automatic learning of an algerian dialect lexicon by using multilingual word embeddings. In Proceedings of the 11th edition of the Language Resources and Evaluation Conference (LREC 2018), Miyazaki, Japan, European Language Resources Association.Google Scholar

Abdul-Rauf, S., and Schwenk, H. 2009. On the use of comparable corpora to improve SMT performance. In In the proceedings of the 12th Conference of the European Chapter of the Association for Computational Linguistics. Athens, Greece, pp. 16–23.Google Scholar

Abdul-Rauf, S., and Schwenk, H. 2011. Parallel sentence generation from comparable corpora for improved SMT. Machine Translation 25 (4): 341–375.Google Scholar

Aljlayl, M., Frieder, O., and Grossman, D. 2002. On Arabic-English cross-language information retrieval: Machine translation approach. In Proceedings of the International Conference on Information Technology: Coding and Computing (ITCC ’02), Washington, DC, USA, IEEE Computer Society, p. 2.Google Scholar

Ballesteros, L., and Croft, B. 1996. Dictionary methods for cross-lingual information retrieval. In Proceedings of the International Conference on Database and Expert Systems Applications, Springer, pp. 791–801.Google Scholar

Blei, D. M., Ng, A. Y., and Jordan, M. I. 2003. Latent Dirichlet allocation. Journal of Machine Learning Research 3: 993–1022.Google Scholar

Brown, P. F., Pietra, V. J. D., Pietra, S. A. D., and Mercer, R. L. 1993. The mathematics of statistical machine translation: Parameter estimation. Computational Linguistics 19 (2): 263–311.Google Scholar

Cui, W., Liu, S., Tan, L., Shi, C., Song, Y., Gao, Z., Qu, H., and Tong, X. 2011. Textflow: Towards better understanding of evolving topics in text. IEEE Transactions on Visualization and Computer Graphics 17 (12): 2412–21.Google Scholar

Delpech, E. 2011. Evaluation of terminologies acquired from comparable corpora: an application perspective. In Proceedings of the 18th International Nordic Conference of Computational Linguistics (NODALIDA 2011), Riga, Latvia, pp. 66–73.Google Scholar

Dhillon, P. S., Foster, D. P., and Ungar, L. H. 2015. Eigenwords: Spectral word embeddings. Journal of Machine Learning Research 16: 3035–78.Google Scholar

Etchegoyhen, T., and Azpeitia, A. 2016. A portable method for parallel and comparable document alignment. Baltic Journal of Modern Computing 4 (2): 243.Google Scholar

Fung, P., and Cheung, P. 2004. Multi-level bootstrapping for extracting parallel sentences from a quasi-comparable corpus. In Proceedings of the 20th International Conference on Computational Linguistics (COLING ’04), Stroudsburg, PA, USA, Association for Computational Linguistics, pp. 1051.Google Scholar

Harispe, S., Ranwez, S., Janaqi, S., and Montmain, J.,2015. Semantic similarity from natural language and ontology analysis. Synthesis Lectures on Human Language Technologies 8 (1): 1–254.Google Scholar

Hieber, F., and Riezler, S. 2015. Bag-of-words forced decoding for cross-lingual information retrieval. InProceedings of the 2015 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Denver, Colorado, Association for Computational Linguistics, pp. 1172–1182.Google Scholar

Ion, R., Ceauşu, A., and Irimia, E. 2011. An expectation maximization algorithm for textual unit alignment. In Proceedings of the 4th Workshop on Building and Using Comparable Corpora: Comparable Corpora and the Web, Portland, Oregon, Association for Computational Linguistics, pp. 128–35.Google Scholar

Knoth, P., Zilka, L., and Zdrahal, Z. 2011. Using explicit semantic analysis for cross-lingual link discovery. In Proceedings of the 5th International Workshop on Cross Lingual Information Access (IJC-NLP 2011), Computational Linguistics and the Information Need of Multilingual Societies (CLIA), Chiang Mai, Thailand, pp. 2–10.Google Scholar

Li, B., and Gaussier, E. 2010. Improving corpus comparability for bilingual lexicon extraction from comparable corpora. In Proceedings of the 23rd International Conference on Computational Linguistics (COLING 2010), Beijing, China, COLING 2010 Organizing Committee, pp. 644–652.Google Scholar

Li, B. 2012. Measuring and Improving Comparable Corpus Quality (Doctoral dissertation). France: University of Grenoble.Google Scholar

Littman, M. L., Dumais, S. T., and Landauer, T. K. 1998. Automatic cross-language information retrieval using latent semantic indexing. In Grefenstette, G. (ed.), Cross-Language Information Retrieval, pp. 51–62. The Springer International Series on Information Retrieval, vol. 2, USA: Springer.Google Scholar

Morin, E., and Prochasson, E. 2011. Bilingual lexicon extraction from comparable corpora enhanced with parallel corpora. In Proceedings of the 4th Workshop on Building and Using Comparable Corpora: Comparable Corpora and the Web, Portland, Oregon, Association for Computational Linguistics, pp. 27–34.Google Scholar

Morin, E., Hazem, A., Boudin, F., and Clouet, E. L. 2015. Lina: Identifying comparable documents from wikipedia. In Proceedings of the 8th Workshop on Building and Using Comparable Corpora (BUCC@ACL/IJCNLP 2015), Beijing, China, Association for Computational Linguistics, pp. 88–91.Google Scholar

Munteanu, D. S., and Marcu, D. 2005. Improving machine translation performance by exploiting non-parallel corpora. Computational Linguistics 31 (4): 477–504.Google Scholar

Oshikiri, T., Fukui, K., and Shimodaira, H. 2016. Cross-lingual word representations via spectral graph embeddings. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (ACL 2016), Berlin, Germany, vol. 2, Short Papers. Association for Computational Linguistics.Google Scholar

Otero, P., and López, I. 2011. Measuring comparability of multilingual corpora extracted from Wikipedia. In Iberian Cross-Language Natural Language Processing Tasks (ICL), Published by Paolo Rosso, Alberto Barrón-Cedeño, Marta Vila, Jorge Civera, Anabela Barreiro, Iñaki Alegria, p. 8.Google Scholar

Pinnis, M., Ion, R., Stefanescu, D., Su, F., Skadina, I., Vasiljevs, A., and Babych, B. 2012 Accurat toolkit for multi-level alignment and information extraction from comparable corpora. In Proceedings of the ACL 2012 System Demonstrations (ACL ’12), Stroudsburg, PA, USA, Association for Computational Linguistics, pp. 91–6.Google Scholar

Saad, M. 2015. Mining Documents and Sentiments in Cross-lingual Context. (Ph.D. thesis). Université de Lorraine.Google Scholar

Saad, M., Langlois, D., and Smaïli, K. 2013. Extracting comparable articles from wikipedia and measuring their comparabilities. Procedia – Social and Behavioral Sciences 95: 40–7. Alicante, Spain. Corpus Resources for Descriptive and Applied Studies. Current Challenges and Future Directions: Selected Papers from the 5th International Conference on Corpus Linguistics (CILC2013). Elsevier.Google Scholar

Saad, M., Langlois, D., and Smaïli, K. 2014. Cross-lingual semantic similarity measure for comparable articles. In Proceedings of the Advances in Natural Language Processing – 9th International Conference on NLP (PolTAL 2014), Warsaw, Poland, Springer International Publishing, pp. 105–15.Google Scholar

Sharoff, S., Zweigenbaum, P., and Rapp, R. 2015. Bucc shared task: cross-language document similarity. In Proceedings of the ACL-IJCNLP, 74. Association for Computational Linguistics.Google Scholar

Skadina, I., Aker, A., Mastropavlos, N., Su, F., Tufis, D., Verlic, M., Vasiljevs, A., Babych, B., Clough, P., Gaizauskas, R., Glaros, N., Paramita, M. L., and Pinnis, M. 2012. Collecting and using comparable corpora for statistical machine translation. In Proceedings of the 8th International Conference on Language Resources and Evaluation (LREC 2012), Istanbul, Turkey, European Language Resources Association, pp. 438–445.Google Scholar

Smith, J., Quirk, C., and Toutanova, K. 2010. Extracting parallel sentences from comparable corpora using document level alignment. In Proceedings of the Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics, Association for Computational Linguistics, pp. 403–11.Google Scholar

Ture, F. 2013. Searching to Translate and Translating to Search: When Information Retrieval Meets Machine Translation. (Ph.D. thesis). Graduate School of the University of Maryland, College Park.Google Scholar

Vulić, I., and Moens, M.-F. 2014. Probabilistic models of cross-lingual semantic similarity in context based on latent cross-lingual concepts induced from comparable data. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP 2014), Association for Computational Linguistics (ACL), pp. 349–62.Google Scholar

Vulić, I., and Moens, M.-F. 2015. Monolingual and cross-lingual information retrieval models based on (bilingual) word embeddings. In Proceedings of the 38th International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR ’15, New York, NY, USA, Association for Computing Machinery, pp. 363–72.Google Scholar

Wołk, K., and Marasak, K. 2014. Building subject-aligned comparable corpora and mining it for truly parallel sentence pairs. Procedia Technology 18: 126–132, Elsevier.Google Scholar

Article contents

Alignment of comparable documents: Comparison of similarity measures on French–English–Arabic data

Abstract

Access options

Article purchase

Temporarily unavailable

References

Save article to Kindle

Save article to Dropbox

Save article to Google Drive

Reply to: Submit a response

Your details

You have entered the maximum number of contributors

Conflicting interests