Hostname: page-component-78c5997874-94fs2 Total loading time: 0 Render date: 2024-11-03T01:54:08.753Z Has data issue: false hasContentIssue false

New treebank or repurposed? On the feasibility of cross-lingual parsing of Romance languages with Universal Dependencies

Published online by Cambridge University Press:  06 October 2017

MARCOS GARCIA
Affiliation:
LyS Group, Departamento de Letras, Facultade de Filoloxía, Universidade da Coruña, Campus de A Coruña, 15071 A Coruã, Galicia, Spain e-mail: [email protected]
CARLOS GÓMEZ-RODRÍGUEZ
Affiliation:
LyS Group, Departamento de Computación, Facultade de Informática, Universidade da Coruña, Campus de A Coruña, 15071 A Coruña, Galicia, Spain e-mail: [email protected], [email protected]
MIGUEL A. ALONSO
Affiliation:
LyS Group, Departamento de Computación, Facultade de Informática, Universidade da Coruña, Campus de A Coruña, 15071 A Coruña, Galicia, Spain e-mail: [email protected], [email protected]

Abstract

This paper addresses the feasibility of cross-lingual parsing with Universal Dependencies (UD) between Romance languages, analyzing its performance when compared to the use of manually annotated resources of the target languages. Several experiments take into account factors such as the lexical distance between the source and target varieties, the impact of delexicalization, the combination of different source treebanks or the adaptation of resources to the target language, among others. The results of these evaluations show that the direct application of a parser from one Romance language to another reaches similar labeled attachment score (LAS) values to those obtained with a manual annotation of about 3,000 tokens in the target language, and unlabeled attachment score (UAS) results equivalent to the use of around 7,000 tokens, depending on the case. These numbers can noticeably increase by performing a focused selection of the source treebanks. Furthermore, the removal of the words in the training corpus (delexicalization) is not useful in most cases of cross-lingual parsing of Romance languages. The lessons learned with the performed experiments were used to build a new UD treebank for Galician, with 1,000 sentences manually corrected after an automatic cross-lingual annotation. Several evaluations in this new resource show that a cross-lingual parser built with the best combination and adaptation of the source treebanks performs better (77 percent LAS and 82 percent UAS) than using more than 16,000 (for LAS results) and more than 20,000 (UAS) manually labeled tokens of Galician.

Type
Articles
Copyright
Copyright © Cambridge University Press 2017 

Access options

Get access to the full version of this content by using one of the access options below. (Log in options will check for institutional or personal access. Content may require purchase if you do not have access.)

Footnotes

This work has been partially supported by the Spanish Ministry of Economy and Competitiveness (MICINN) through a Juan de la Cierva formación grant (FJCI-2014-22853), by the projects with references FFI2014-51978-C2-1-R and FFI2014-51978-C2-2-R (MINECO), and by the European Research Council (ERC) under the European Union’s Horizon 2020 Research and Innovation Programme (grant agreement no. 714150 – FASTPARSE).

References

Agić, Ž., Hovy, D., and Søgaard, A. 2015. If all you have is a bit of the Bible: Learning POS taggers for truly low-resource languages. In Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference of the Asian Federation of Natural Language Processing, ACL-IJCNLP-2015. Short Papers, Beijing. Association for Computational Linguistics, pp. 268–72.Google Scholar
Agić, Ž., Johannsen, A., Plank, B., Martínez Alonso, H., Schluter, N., and Søgaard, A. 2016. Multilingual projection for parsing truly low-resource languages. Transactions of the Association for Computational Linguistics, 4: 301–12.Google Scholar
Agić, Ž., Tiedemann, J., Merkler, D., Krek, S., Dobrovoljc, K., and Moze, S. 2014. Cross-lingual dependency parsing of related languages with rich morphosyntactic tagsets. In Proceedings of the EMNLP’2014 Workshop on Language Technology for Closely Related Languages and Language Variants, Doha. Association for Computational Linguistics, pp. 1324.Google Scholar
Ammar, W., Mulcaire, G., Ballesteros, M., Dyer, C., and Smith, N. A., 2016. Many languages, one parser. Transactions of the Association for Computational Linguistics 4: 431–44.Google Scholar
Aufrant, L., Wisniewski, G., and Yvon, F. 2016. Zero-resource dependency parsing: boosting delexicalized cross-lingual transfer with linguistic knowledge. In Proceedings of the 26th International Conference on Computational Linguistics: Technical Papers, COLING-2016, Osaka. Association for Computational Linguistics, pp. 119–30.Google Scholar
Bejček, E., Panevová, J., Popelka, J., Straňák, P., Ševčíková, M., Štěpánek, J., and Žabokrtskỳ, Z. 2012. Prague dependency treebank 2.5 – A revisited version of PDT 2.0. In Proceedings of the 24th International Conference on Computational Linguistics, COLING-2012, Bombay. Association for Computational Linguistics, pp. 231–46.Google Scholar
Berzak, Y., Huang, Y., Barbu, A., Korhonen, A., and Katz, B. 2016. Anchoring and agreement in syntactic annotations. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, EMNLP-2016, Austin. Association for Computational Linguistics, pp. 2215–24.Google Scholar
Cintra, L. F. L., and Cunha, C., 1984. Nova gramática do português contemporâneo. Lisbon: Livraria Sá da Costa.Google Scholar
Cohen, S. B., Das, D., and Smith, N. A. 2011. Unsupervised structure prediction with non-parallel multilingual guidance. In Proceedings of the Conference on Empirical Methods in Natural Language Processing, EMNLP-2011, Edinburgh. Association for Computational Linguistics, pp. 5061.Google Scholar
de Marneffe, M.-C., Dozat, T., Silveira, N., Haverinen, K., Ginter, F., Nivre, J., and Manning, C. D., 2014. Universal Stanford dependencies: A cross-linguistic typology. In Proceedings of the 9th edition of the International Language Resources and Evaluation Conference, LREC-2014, vol. 14, Reykjavik. European Language Resources and Evaluation, pp. 4585–92.Google Scholar
de Marneffe, M.-C., MacCartney, B., and Manning, C. D., 2006. Generating typed dependency parses from phrase structure parses. In Proceedings of the 5th edition of the International Language Resources and Evaluation Conference, LREC-2006, vol. 6, Portorož. European Language Resources and Evaluation, pp. 449–54.Google Scholar
de Marneffe, M.-C., and Manning, C. D. 2008. The Stanford typed dependencies representation. In Proceedings of the Workshop on Cross-Framework and Cross-Domain Parser Evaluation, COLING-2008, Manchester. Association for Computational Linguistics, pp. 18.Google Scholar
Duong, L., Cohn, T., Bird, S., and Cook, P. 2015. A neural network model for low-resource universal dependency parsing. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, EMNLP-2015, Lisbon. Association for Computational Linguistics, pp. 339–48.CrossRefGoogle Scholar
Durrett, G., Pauls, A., and Klein, D. 2012. Syntactic transfer using a bilingual lexicon. In Proceedings of the 2012 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning, EMNLP-CoNLL-2012, Jeju Island. Association for Computational Linguistics, pp. 111.Google Scholar
Erjavec, T., 2012. MULTEXT-East: morphosyntactic resources for Central and Eastern European languages. Language Resources and Evaluation 46 (1): 131–42.Google Scholar
Figueroa, T. V. 1997. Estruturas fonéticas de tres dialectos de Vigo. Verba (24): 313–32.Google Scholar
Freixeiro Mato, X. R., 2000. Gramática da lingua galega II. Morfosintaxe. Vigo: A Nosa Terra.Google Scholar
Ganchev, K., Gillenwater, J., and Taskar, B., 2009. Dependency grammar induction via bitext projection constraints. In Proceedings of the Joint Conference of the 47th Annual Meeting of the Association for Computational Linguistics and the 4th International Joint Conference on Natural Language Processing of the Asian Federation of Natural Language Processing, ACL-IJCNLP-2009, vol. 1, Singapore. Association for Computational Linguistics, pp. 369–77.Google Scholar
Garcia, M. 2016. Universal dependencies guidelines for the Galician-TreeGal treebank. Technical Report, LyS Group, Universidade da Coruña.Google Scholar
Garcia, M., and Gamallo, P. 2015. Yet another suite of multilingual NLP tools. In Languages, Applications and Technologies. Communications in Computer and Information Science, vol. 563, pp. 6575. Switzerland: Springer.Google Scholar
Gimpel, K., and Smith, N. A., 2014. Phrase dependency machine translation with quasi-synchronous tree-to-tree features. Computational Linguistics 40 (2): 349401.Google Scholar
Guo, J., Che, W., Yarowsky, D., Wang, H., and Liu, T. 2015. Cross-lingual dependency parsing based on distributed representations. In Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), Beijing. Association for Computational Linguistics, pp. 1234–44.Google Scholar
Guo, J., Che, W., Yarowsky, D., Wang, H., and Liu, T. 2016. A representation learning framework for multi-source transfer parsing. In Proceedings of the 30th AAAI Conference on Artificial Intelligence, AAAI-2016, Phoenix. Association for the Advancement of Artificial Intelligence, pp. 2734–40.Google Scholar
Hwa, R., Resnik, P., Weinberg, A., Cabezas, C., and Kolak, O., 2005. Bootstrapping parsers via syntactic projection across parallel texts. Natural Language Engineering 11 (03): 311–25.Google Scholar
Kendall, M. G., 1938. A new measure of rank correlation. Biometrika 30 (1/2): 8193.CrossRefGoogle Scholar
Klein, D. and Manning, C. D. 2004. Corpus-based induction of syntactic structure: Models of dependency and constituency. In Proceedings of the 42nd Annual Meeting on Association for Computational Linguistics, ACL-2004, Barcelona. Association for Computational Linguistics, pp. 479–86.Google Scholar
Lacroix, O., Aufrant, L., Wisniewski, G., and Yvon, F. 2016a. Frustratingly easy cross-lingual transfer for transition-based dependency parsing. In Proceedings of the 15th Annual Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL HLT-2016, San Diego. Association for Computational Linguistics, pp. 1058–63.Google Scholar
Lacroix, O., Wisniewski, G., and Yvon, F. 2016b. Cross-lingual dependency transfer: what matters? Assessing the impact of pre- and post-processing. In Proceedings of the Workshop on Multilingual and Cross-lingual Methods in NLP at the 15th Annual Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL HLT-2016, San Diego. Association for Computational Linguistics, pp. 20–9.Google Scholar
Malvar, P., Pichel, J. R., Senra, Ó., Gamallo, P., and Garcia, A., 2010. Vencendo a escassez de recursos computacionais. Carvalho: Tradutor Automático Estatístico Inglês-Galego a partir do corpus paralelo Europarl Inglês-Português. Linguamática 2 (2): 31–8.Google Scholar
McDonald, R., Nivre, J., Quirmbach-Brundage, Y., Goldberg, Y., Das, D., Ganchev, K., Hall, K. B., Petrov, S., Zhang, H., Täckström, O., Bedini, C., Bertomeu Castelló, N., and Lee, J. 2013. Universal dependency annotation for multilingual parsing. In Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics, ACL-2013, Sofia. Association for Computational Linguistics, pp. 92–7.Google Scholar
McDonald, R., Petrov, S., and Hall, K. 2011. Multi-source transfer of delexicalized dependency parsers. In Proceedings of the Conference on Empirical Methods in Natural Language Processing, EMNLP-2011, Edinburgh. Association for Computational Linguistics, pp. 6272.Google Scholar
McEnery, T. and Hardie, A., 2011. Corpus Linguistics: Method, Theory and Practice. Cambridge: Cambridge University Press.Google Scholar
Muniz, M. C., Nunes, M. D. G. V., and Laporte, E. 2005. UNITEX-PB, a set of flexible language resources for Brazilian Portuguese. In Proceedings of the Workshop on Technology on Information and Human Language, TIL, São Leopoldo. Sociedade Brasileira de Computação, pp. 2059–68.Google Scholar
Naseem, T., Barzilay, R., and Globerson, A. 2012. Selective sharing for multilingual dependency parsing. In Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics(Volume 1: Long Papers), ACL-2012, Jeju Island. Association for Computational Linguistics, pp. 629–37.Google Scholar
Nguyen, T.-V. T., Moschitti, A., and Riccardi, G., 2009. Convolution kernels on constituent, dependency and sequential structures for relation extraction. In Proceedings of the Conference on Empirical Methods in Natural Language Processing, EMNLP-2009, vol. 3, Singapore. Association for Computational Linguistics, pp. 1378–87.Google Scholar
Nivre, J. 2004. Incrementality in deterministic dependency parsing. In Proceedings of the Workshop on Incremental Parsing: Bringing Engineering and Cognition Together, Barcelona. Association for Computational Linguistics, pp. 50–7.Google Scholar
Nivre, J., de Marneffe, M.-C., Ginter, F., Goldberg, Y., Hajič, J., Manning, C. D., McDonald, R., Petrov, S., Pyysalo, S., Silveira, N., Tsarfaty, R., and Zeman, D. 2016. Universal dependencies v1: A multilingual treebank collection. In Proceedings of the 10th edition of the International Language Resources and Evaluation Conference, LREC-2016, Portorož. European Language Resources and Evaluation, pp. 1659–66.Google Scholar
Nivre, J., Hall, J., Nilsson, J., Chanev, A., Eryigit, G., Kübler, S., Marinov, S., and Marsi, E., 2007. MaltParser: A language-independent system for data-driven dependency parsing. Natural Language Engineering 13 (02): 95135.CrossRefGoogle Scholar
Padró, L., and Stanilovsky, E. 2012. Freeling 3.0: Towards wider multilinguality. In Proceedings of the 8th edition of the International Language Resources and Evaluation Conference, LREC-2012, Istambul. European Language Resources and Evaluation, pp. 2473–9.Google Scholar
Petrov, S., Das, D., and McDonald, R., 2012. A universal part-of-speech tagset. In Proceedings of the 8th edition of the International Language Resources and Evaluation Conference, LREC-2012, Istambul, European Language Resources and Evaluation, pp. 2089–96.Google Scholar
Rasooli, M. S., and Collins, M. 2015. Density-driven cross-lingual transfer of dependency parsers. In Proceedings of the Conference on Empirical Methods in Natural Language Processing, EMNLP-2015, Lisbon. Association for Computational Linguistics, pp. 328–38.Google Scholar
Rojo, G., Martínez, M. L., Noya, E. D., and Barcala, F. M. 2015. Corpus de adestramento do Etiquetador/Lematizador do Galego Actual (XIADA), Versión 2.6. http://corpus.cirp.es/xiada/corpus_xiada_2_6.tar.gz. Centro Ramón Piñeiro para a Investigación en Humanidades.Google Scholar
Rosa, R., Masek, J., Marecek, D., Popel, M., Zeman, D., and Zabokrtskỳ, Z. 2014. HamleDT 2.0: Thirty Dependency Treebanks Stanfordized. In Proceedings of the 9th edition of the International Language Resources and Evaluation Conference, LREC-2014, Reykjavik. European Language Resources and Evaluation, pp. 2334–41.Google Scholar
Rosa, R., and Žabokrtskỳ, Z. 2015. KLcpos3 - a language similarity measure for delexicalized parser transfer. In Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 2: Short Papers), Beijing. Association for Computational Linguistics, pp. 243–9.Google Scholar
Rosa, R., and Žabokrtskỳ, Z. 2015b. MSTParser Model interpolation for multi-source delexicalized transfer. In Proceedings of the 14th International Conference on Parsing Technologies, Bilbao. Association for Computational Linguistics, pp. 71–5.CrossRefGoogle Scholar
Smith, D. A., and Eisner, J., 2009. Parser adaptation and projection with quasi-synchronous grammar features. In Proceedings of the Conference on Empirical Methods in Natural Language Processing, EMNLP-2009, vol. 2, Singapore. Association for Computational Linguistics, pp. 822–31.Google Scholar
Socher, R., Perelygin, A., Wu, J. Y., Chuang, J., Manning, C. D., Ng, A. Y., and Potts, C. 2013. Recursive deep models for semantic compositionality over a sentiment treebank. In Proceedings of the Conference on Empirical Methods in Natural Language Processing, EMNLP-2013, Seattle. Association for Computational Linguistics, pp. 1631–42.Google Scholar
Søgaard, A., 2011. Data point selection for cross-language adaptation of dependency parsers. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies: Short Papers, ACL HLT-2011, vol. 22, Portland. Association for Computational Linguistics, pp. 682–6.Google Scholar
Søgaard, A., Agić, Ž., Martínez Alonso, H., Plank, B., Bohnet, B., and Johannsen, A. 2015. Inverted indexing for cross-lingual NLP. In Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), Beijing. Association for Computational Linguistics, pp. 1713–22.Google Scholar
Straka, M., Hajič, J., and Straková, J., 2016. UDPipe: Trainable pipeline for processing CoNLL-U files performing tokenization, morphological analysis, POS tagging and parsing. In Proceedings of the 10th International Conference on Language Resources and Evaluation, LREC-2016, Portoroz, European Language Resources Association, pp. 4290–7.Google Scholar
Täckström, O., McDonald, R., and Uszkoreit, J., 2012. Cross-lingual word clusters for direct transfer of linguistic structure. In Proceedings of the 2012 conference of the North American chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL HLT-2012, Montreal, Association for Computational Linguistics, pp. 477–87.Google Scholar
Täckström, O., McDonald, R., and Nivre, J. 2013. Target language adaptation of discriminative transfer parsers. In Proceedings of the 2013 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-2013. Atlanta, Association for Computational Linguistics, pp. 1061–71.Google Scholar
Teyssier, P., 1982. História da língua portuguesa. Lisbon: Livraria Sá da Costa.Google Scholar
Tiedemann, J. 2014. Rediscovering annotation projection for cross-lingual parser induction. In Proceedings of the 25th International Conference on Computational Linguistics, COLING-2014, Dublin. Association for Computational Linguistics, pp. 1854–64.Google Scholar
Tiedemann, J., 2015a. Improving the cross-lingual projection of syntactic dependencies. In Proceedings of the 20th Nordic Conference of Computational Linguistics, NODALIDA-2015, vol. 109, Vilnius. Linköping University Electronic Press, pp. 191–9.Google Scholar
Tiedemann, J. 2015b. Cross-lingual dependency parsing with universal dependencies and predicted PoS labels. In Proceedings of the 3rd International Conference on Dependency Linguistics, Depling-2015), Uppsala. Association for Computational Linguistics, pp. 340–9.Google Scholar
Tiedemann, J., and Agić, Ž., 2016. Synthetic treebanking for cross-lingual dependency parsing. Journal of Artificial Intelligence Research 55: 209–48.Google Scholar
Tiedemann, J., Agić, Ž., and Nivre, J. 2014. Treebank translation for cross-lingual parser induction. In Proceedings of the 18th Conference on Computational Natural Language Learning, CoNLL-2014, Baltimore. Association for Computational Linguistics, pp. 130–40.Google Scholar
Tsarfaty, R. 2013. A Unified Morpho-syntactic scheme of stanford dependencies. In Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics, ACL-2013, Sofia. Association for Computational Linguistics, pp. 578–84.Google Scholar
Vilares, D., Alonso, M. A., and Gómez-Rodríguez, C. 2016. One model, two languages: training bilingual parsers with harmonized treebanks. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics, ACL-2016, Berlin: Association for Computational Linguistics, pp. 425–31.Google Scholar
Xunta de Galicia (AA.VV). 2004. Plan xeral de normalización da lingua galega. Xunta de Galicia, Consellería de Educación e Ordenación Universitaria, Dirección Xeral de Política Lingüística.Google Scholar
Yarowsky, D., Ngai, G., and Wicentowski, R. 2001. Inducing multilingual text analysis tools via robust projection across aligned corpora. In Proceedings of the 1st International Conference on Human Language Technology Research, HLT-2001, San Diego. Association for Computational Linguistics, pp. 1–8.Google Scholar
Zeman, D. 2008. Reusable tagset conversion using tagset drivers. In Proceedings of the 6th edition of the International Language Resources and Evaluation Conference, LREC-2008, Marrakech. European Language Resources and Evaluation, pp. 213–18.Google Scholar
Zeman, D., Dušek, O., Mareček, D., Popel, M., Ramasamy, L., Štěpánek, J., Žabokrtskỳ, Z., and Hajič, J., 2014. HamleDT: Harmonized multi-language dependency treebank. Language Resources and Evaluation 48 (4): 601–37.Google Scholar
Zeman, D. and Resnik, P. 2008. Cross-language parser adaptation between related languages. In Proceedings of the Workshop on NLP for Less Privileged Language at the 3rd International Joint Conference on Natural Language Processing, IJCNLP-2008, Hyderabad. Asian Federation of Natural Language Processing, pp. 3542.Google Scholar
Zhang, Y., and Barzilay, R. 2015. Hierarchical low-rank tensors for multilingual transfer parsing. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, EMNLP-2015, Lisbon. Association for Computational Linguistics, pp. 1857–67.Google Scholar