Hostname: page-component-cd9895bd7-mkpzs Total loading time: 0 Render date: 2024-12-24T01:43:37.540Z Has data issue: false hasContentIssue false

Two approaches to compilation of bilingual multi-word terminology lists from lexical resources

Published online by Cambridge University Press:  28 January 2020

Branislava Šandrih*
Affiliation:
Faculty of Philology, University of Belgrade, Belgrade, Serbia
Cvetana Krstev
Affiliation:
Faculty of Philology, University of Belgrade, Belgrade, Serbia
Ranka Stanković
Affiliation:
Faculty of Mining and Geology, University of Belgrade, Belgrade, Serbia
*
*Corresponding author. Email: [email protected]

Abstract

In this paper, we present two approaches and the implemented system for bilingual terminology extraction that rely on an aligned bilingual domain corpus, a terminology extractor for a target language, and a tool for chunk alignment. The two approaches differ in the way terminology for the source language is obtained: the first relies on an existing domain terminology lexicon, while the second one uses a term extraction tool. For both approaches, four experiments were performed with two parameters being varied. In the experiments presented in this paper, the source language was English, and the target language Serbian, and a selected domain was Library and Information Science, for which an aligned corpus exists, as well as a bilingual terminological dictionary. For term extraction, we used the FlexiTerm tool for the source language and a shallow parser for the target language, while for word alignment we used GIZA++. The evaluation results show that for the first approach the F1 score varies from 29.43% to 51.15%, while for the second it varies from 61.03% to 71.03%. On the basis of the evaluation results, we developed a binary classifier that decides whether a candidate pair, composed of aligned source and target terms, is valid. We trained and evaluated different classifiers on a list of manually labeled candidate pairs obtained after the implementation of our extraction system. The best results in a fivefold cross-validation setting were achieved with the Radial Basis Function Support Vector Machine classifier, giving a F1 score of 82.09% and accuracy of 78.49%.

Type
Article
Copyright
© Cambridge University Press 2020

Access options

Get access to the full version of this content by using one of the access options below. (Log in options will check for institutional or personal access. Content may require purchase if you do not have access.)

Footnotes

This research was supported by Serbian Ministry of Education and Science under the grants #III 47003 and 178006.

References

Aker, A., Paramita, M. and Gaizauskas, R. (2013). Extracting bilingual terminologies from comparable corpora. In Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics, 1, pp. 402411.Google Scholar
Ananiadou, S., McNaught, J. and Thompson, P. (2012). The English Language in the Digital Age. META-NET White Paper Series. Rehm, G. and Uszkoreit, H. (Series eds). Springer. Available at http://www.meta-net.eu/whitepapersGoogle Scholar
Arcan, M., Turchi, M., Tonelli, S. and Buitelaar, P. (2017). Leveraging bilingual terminology to improve machine translation in a computer aided translation environment. Natural Language Engineering 23(5), 763788.10.1017/S1351324917000195CrossRefGoogle Scholar
Baldwin, T. and Kim, S. N. 2010. Multiword Expressions. Handbook of Natural Language Processing 2, 267292.Google Scholar
Bouamor, D., Semmar, N. and Zweigenbaum, P. (2012). Identifying bilingual multi-word expressions for statistical machine translation. In Calzolari, N., Choukri, K., Declerck, T., Doan, M. U., Maegaard, B., Mariani, J., Moreno, A., Odijk, J. and Piperidis, S. (eds), Proceedings of the 8th International Conference on Language Resources and Evaluation (LREC’12), Istanbul, Turkey: European Language Resources Association (ELRA). Available at http://www.lrec-conf.org/proceedings/lrec2012/pdf/886_Paper.pdfGoogle Scholar
Cram, D. and Daille, B. (2016). Terminology extraction with term variant detection. In Proceedings of ACL-2016 System Demonstrations, pp. 1318.10.18653/v1/P16-4003CrossRefGoogle Scholar
Eibe, F., Hall, M. and Witten, I. (2016). The WEKA Workbench. Online Appendix for “Data Mining: Practical Machine Learning Tools and Techniques”, Morgan Kaufmann, Fourth edition.Google Scholar
Fawi, F. and Delmonte, R. (2015). Italian-arabic domain terminology extraction from parallel corpora. In Proceedings of the Second Italian Conference on Computational Linguistics CLiC-it 2015, p. 130. Accademia University Press.CrossRefGoogle Scholar
Friedman, J. (2001). Greedy function approximation: A gradient boosting machine. Annals of Statistics 29(5), 11891232.10.1214/aos/1013203451CrossRefGoogle Scholar
Gambette, P. and Véeronis, J. (2010). Visualising a text with a tree cloud. In Locarek-Junge, H. and Weihs, C., (eds), Classification as a Tool for Research, pp. 561569, Berlin, Heidelberg: Springer Berlin Heidelberg. ISBN 978-3-642-10745-0.10.1007/978-3-642-10745-0_61CrossRefGoogle Scholar
Garabík, R. and Dimitrova, L. (2015). Extraction and presentation of bilingual correspondences from Slovak-Bulgarian parallel corpus. Cognitive Studies – Études Cognitives 15, 327334.Google Scholar
Hakami, H. and Bollegala, D. (2017). A classification approach for detecting cross-lingual biomedical term translations. Natural Language Engineering 23(1), 3151.CrossRefGoogle Scholar
Hamon, T. and Grabar, N. (2016). Adaptation of cross-lingual transfer methods for the building of medical terminology in Ukrainian. In Proceedings of the 17th International Conference on Intelligent Text Processing and Computational Linguistics (CICLING 2016). LNCS. Springer.Google Scholar
Hazem, A. and Morin, E. (2016). Efficient data selection for bilingual terminology extraction from comparable corpora. In Proceedings of COLING 2016, the 26th International Conference on Computational Linguistics: Technical Papers, pp. 34013411.Google Scholar
Hosmer, D. W. Jr., Lemeshow, S. and Sturdivant, R. X. (2013). Applied Logistic Regression, 398. John Wiley & Sons.CrossRefGoogle Scholar
Irvine, A. and Callison-Burch, C. (2016). End-to-end statistical machine translation with zero or small paarallel texts. Natural Language Engineering 22(4), 517548.CrossRefGoogle Scholar
Joachims, T. (1998). Text categorization with support vector machines: Learning with many relevant features. In European Conference on Machine Learning, pp. 137142. Springer.Google Scholar
Koehn, P., Hoang, H., Birch, A., Callison-Burch, C., Federico, M., Bertoldi, N., Cowan, B., Shen, W., Moran, C. and Zens, R. (2007). Moses: Open source toolkit for statistical machine translation. In Proceedings of the 45th Annual Meeting of the Association for Computational Linguistics Companion Volume Proceedings of the Demo and Poster Sessions, pp. 177180. Association for Computational Linguistics.Google Scholar
Kontonatsios, G., Claudiu, M., Korkontzelos, I., Thompson, P. and Ananiadou, S. (2014). A hybrid approach to compiling bilingual dictionaries of medical terms from parallel corpora. Statistical Language and Speech Processing, 8791. LNCS, pp. 5769.CrossRefGoogle Scholar
Kovačević, L., Begenišić, D. D. and Injac-Malbaša V. (2014). Dictionary of Library and Information Sciences.Google Scholar
Krstev, C. (2008). Processing of Serbian. Automata, Texts and Electronic Dictionaries. Faculty of Philology of the University of Belgrade.Google Scholar
Krstev, C., (2014). Serbian WordNet. University of Belgrade, HLT Group and JeRTeh. Available at http://korpus.matf.bg.ac.rs/r22Google Scholar
Krstev, C., Šandrih, B., Stanković, R. and Mladenović, M. (2018). Using english baits to catch serbian multi-word terminology. In Chair, N. C. C., Choukri, K., Cieri, C., Declerck, T., Goggi, S.,Hasida, K., Isahara, H.Maegaard, B., Mariani, J., Mazo, H., Moreno, A.,Odijk, J., Piperidis, S. and Tokunaga, T. (eds), Proceedings of the11th International Conference on Language Resources and Evaluation (LREC 2018), Paris, France: European Language Resources Association (ELRA). http://www.lrec-conf.org/proceedings/lrec2018/pdf/384.pdfGoogle Scholar
Lahbib, W., Bounhas, I. and Elayeb, B. (2014). Arabic-english domain terminology extraction from aligned corpora. In Meersman, R., Panetto, H., Dillon, T. , Missikoff, M., Liu, L., Pastor, O., Cuzzocrea, A. and Sellis, T. (eds), On the Move to Meaningful Internet Systems (OTM 2014 Conferences, Confederated International Conferences: CoopIS, and ODBASE 2014, Amantea, Italy, October 27–31, 2014, Proceedings), pp. 745759. Berlin Heidelberg: Springer.CrossRefGoogle Scholar
Liaw, A. and Wiener, M. (2002). Classification and regression by random forest. R News 2(3), 1822.Google Scholar
Sabtan, N. and Muhammad, Y. (2016). Bilingual lexicon extraction from arabic-english parallel corpora with a view to machine translation. Arab World English Journal 7(5), 317336.Google Scholar
Och, F. J. and Ney, H. (2000). Improved statistical alignment models. In 38th Annual Meeting on Association for Computational Linguistics, pp. 440447. Stroudsburg, PA: Association for Computational Linguistics.Google Scholar
Oliver, A. (2017). A system for terminology extraction and translation equivalent detection in real time: Efficient use of statistical machine translation phrase tables. Machine Translation 31(3), 147161.CrossRefGoogle Scholar
Papineni, K., Roukos, S., Ward, T. and Zhu, W. J. (2002). BLEU: A method for automatic evaluation of machine translation. In Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics, pp. 311318. Association for Computational Linguistics.Google Scholar
Pianta, E., Girardi, C. and Zanoli, R. (2008). The TextPro tool suite. In: Proceedings of 6th edition of the Language Resources and Evaluation Conference.Google Scholar
Pinnis, M., Ljubešić, N., Stefanescu, D., Skadina, I., Tadić, M. and Gornostay, T. (2012). Term extraction, tagging, and mapping tools for under-resourced languages. In Proceedings of the 10th Conference on Terminology and Knowledge Engineering (TKE 2012), June, pp. 20-21.Google Scholar
Princeton University (2010). About WordNet. Princeton University.Google Scholar
Rish, I. (2001). An empirical study of the naive Bayes classifier. In IJCAI 2001 Workshop on Empirical Methods in Artificial Intelligence, 3, pp. 4146. New York: IBM.Google Scholar
Semmar, N. (2018). A hybrid approach for automatic extraction of bilingual multiword expressions from parallel corpora. In Proceedings of the 11th International Conference on Language Resources and Evaluation (LREC 2018), Paris, France. European Language Resources Association (ELRA). Available at http://www.lrec-conf.org/proceedings/lrec2018/pdf/958.pdfGoogle Scholar
Spasić, I., Greenwood, M., Preece, A., Francis, N. and Elwyn, G. (2013). FlexiTerm: A flexible term recognition method. Journal of Biomedical Semantics 4(1), 27.CrossRefGoogle ScholarPubMed
Stanković, R., Krstev, C., Obradović, I., Lazić, B. and Trtovac, A. (2016). Rule-based automatic multi-word term extraction and lemmatization. In Proceedings of the 10th International Conference on Language Resources and Evaluation (LREC 2016), Paris, France: European Language Resources Association (ELRA). Available at http://www.lrec-conf.org/proceedings/lrec2016/pdf/1033_Paper.pdfGoogle Scholar
Stanković, R., Krstev, C., Vitas, D., Vulović, N. and Kitanović, O. (2017). Keyword-Based Search on Bilingual Digital Libraries, pp. 112123. Springer International Publishing, Cham. In Cal, A., Gorgan, D., Ugarte, M. (eds) Semantic Keyword–Based Search on Structured Data Sources – Second COST Action IC1302 International KEYSTONE Conference, IKC 2016, Cluj–Napoca, Romania, September 8–9, 2016. Revised Selected Papers. Springer, LNCS, 10151, DOI: 10.1007/978-3-319-53640-8_10CrossRefGoogle Scholar
Stanković, R., Krstev, C., Lazić, B. and Vorkapić, D. (2015). A bilingual digital library for academic and entrepreneurial knowledge management. In Proceeding of 10th International Forum on Knowledge Asset Dynamics – IFKAD 2015: Culture, Innovation and Entrepreneurship: connecting the knowledge dots, Bari, Italy, 10–12 June 2015, pp. 17781788. Bari (2015). ISSN: 2280-787XGoogle Scholar
Stanković, R., Obradović, I., Krstev, C. and Vitas, D. (2011). Production of morphological dictionaries of multi-word units using a multipurpose tool. In Jassem, K., Fuglewicz, P. W., Piasecki, M. and Przepirkowski, A. (eds), Proceedings of the Computational Linguistics-Applications Conference, October 17–19, 2011. Jachranka, Poland, pp. 7784, Polish Information Processing Society.Google Scholar
Thurmair, G. and Aleksić, V. (2012). Creating term and lexicon entries from phrase tables. In Proceedings of the 16th Conference of the European Association for Machine Translation (EAMT 2012), Trento, Italy.Google Scholar
Tsvetkov, Y. and Wintner, S. (2010). Extraction of multi-word expressions from small parallel corpora. In Proceedings of the 23rd International Conference on Computational Linguistics: Posters, COLING ’10, pp. 12561264, Stroudsburg, PA, USA. Association for Computational Linguistics.Google Scholar
Vintar, Š. and Fišer, D. (2008). Harvesting multi-word expressions from parallel corpora. In Proceedings of the 6th International Conference on Language Resources and Evaluation (LREC’08), Marrakech, Morocco: European Language Resources Association (ELRA). Available at http://www.lrec-conf.org/proceedings/lrec2008/Google Scholar
Vitas, D., Popović, L., Krstev, C., Obradović, I., Pavlović Laźetić, G. and Stanojević M. (2012). Srpski jezik u digitalnom dobu – The Serbian Language in the Digital Age. META-NET White Paper Series. Rehm, G. and Uszkoreit, H. (Series eds). Springer. Available at http://www.meta-net.eu/whitepapersGoogle Scholar
Xu, Y., Chen, L., Wei, J., Ananiadou, S., Fan, Y., Qian, Y., Eric, I., Chang, C. and Tsujii, J. (2015). Bilingual term alignment from comparable corpora in english discharge summary and chinese discharge summary. BMC Bioinformatics 16(1), 149.CrossRefGoogle ScholarPubMed