Hostname: page-component-745bb68f8f-d8cs5 Total loading time: 0 Render date: 2025-01-09T21:48:23.178Z Has data issue: false hasContentIssue false

Estimating word-level quality of statistical machine translation output using monolingual information alone

Published online by Cambridge University Press:  27 March 2019

Arda Tezcan*
Affiliation:
LT3, Language and Translation Technology Team, Department of Translation, Interpreting and Communication, Ghent University, Ghent, Belgium
Véronique Hoste
Affiliation:
LT3, Language and Translation Technology Team, Department of Translation, Interpreting and Communication, Ghent University, Ghent, Belgium
Lieve Macken
Affiliation:
LT3, Language and Translation Technology Team, Department of Translation, Interpreting and Communication, Ghent University, Ghent, Belgium
*
*Corresponding author. Email: [email protected]

Abstract

Various studies show that statistical machine translation (SMT) systems suffer from fluency errors, especially in the form of grammatical errors and errors related to idiomatic word choices. In this study, we investigate the effectiveness of using monolingual information contained in the machine-translated text to estimate word-level quality of SMT output. We propose a recurrent neural network architecture which uses morpho-syntactic features and word embeddings as word representations within surface and syntactic n-grams. We test the proposed method on two language pairs and for two tasks, namely detecting fluency errors and predicting overall post-editing effort. Our results show that this method is effective for capturing all types of fluency errors at once. Moreover, on the task of predicting post-editing effort, while solely relying on monolingual information, it achieves on-par results with the state-of-the-art quality estimation systems which use both bilingual and monolingual information.

Type
Article
Copyright
© Cambridge University Press 2019 

Access options

Get access to the full version of this content by using one of the access options below. (Log in options will check for institutional or personal access. Content may require purchase if you do not have access.)

References

Abadi, M., et al. (2016). Tensorflow: Large-Scale machine learning on heterogeneous distributed systems. In CoRR, abs/1603.04467.Google Scholar
Abdelsalam, A., Bojar, O. and El-Beltagy, S. (2016). Bilingual embeddings and word alignments for translation quality estimation. In Proceedings of the First Conference on Machine Translation. Berlin, Germany: Association for Computational Linguistics, pp. 764771.Google Scholar
Anastasakos, T., Kim, Y.-B. and Deoras, A. (2014). Task specific continuous word representations for mono and multilingual spoken language understanding. In 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 3246–3250CrossRefGoogle Scholar
Avraham, O. and Goldberg, Y. (2017). The interplay of semantics and morphology in word embeddings. In CoRR, abs/1704.01938. Retrieved from http://arxiv.org/abs/1704.01938CrossRefGoogle Scholar
Avramidis, E. (2017). Comparative quality estimation for machine translation observations on machine learning and features. The Prague Bulletin of Mathematical Linguistics 108(1), 307318.CrossRefGoogle Scholar
Axelrod, A., He, X. and Gao, J. (2011). Domain adaptation via pseudo in-domain data selection. In Proceedings of the conference on empirical methods in natural language processing (pp. 355362). Stroudsburg, PA, USA: Association for Computational Linguistics. Retrieved from http://dl.acm.org/citation.cfm?id=2145432.2145474Google Scholar
Bahdanau, D., Cho, K. and Bengio, Y. (2014). Neural machine translation by jointly learning to align and translate. In CoRR, abs/1409.0473. Retrieved from http://arxiv.org/abs/1409.0473Google Scholar
Bentivogli, L., Bisazza, A., Cettolo, M. and Federico, M. (2016). Neural versus phrase-based machine translation quality: A case study. In CoRR, abs/1608.04631.CrossRefGoogle Scholar
Bertoldi, N. and Federico, M. (2009). Domain adaptation for statistical machine translation with monolingual resources. In Proceedings of the Fourth Workshop on Statistical Machine Translation. Stroudsburg, PA, USA: Association for Computational Linguistics, pp. 182189. Retrieved from http://dl.acm.org/citation.cfm?id=1626431.1626468CrossRefGoogle Scholar
Blain, F., Scarton, C. and Specia, L. (2017). Bilexical embeddings for quality estimation. In Proceedings of the Second Conference on Machine Translation, pp. 545–550.CrossRefGoogle Scholar
Blatz, J., et al. (2004). Confidence estimation for machine translation. In Proceedings of the 20th International Conference on Computational Linguistics. Stroudsburg, PA, USA: Association for Computational Linguistics. Retrieved from https://doi.org/10.3115/1220355.1220401Google Scholar
Bohnet, B. and Nivre, J. (2012). A transition-based system for joint part-of speech tagging and labeled non-projective dependency parsing. In Proceedings of the 2012 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning. Association for Computational Linguistics, pp. 1455–1465.Google Scholar
Bojar, O., et al. (2014). Findings of the 2014 workshop on statisticalmachine translation. In Proceedings of the Ninth Workshop on Statistical Machine Translation, pp. 12–58.CrossRefGoogle Scholar
Bojar, O., et al. (2015). Findings of the 2015 workshop on statisticalmachine translation. In Proceedings of the Tenth Workshop on Statistical Machine Translation. Lisbon, Portugal: Association for Computational Linguistics, pp. 146. Retrieved from http://aclweb.org/anthology/W15-3001CrossRefGoogle Scholar
Bojar, O., et al. (2016). Findings of the 2016 conference on machine translation. In Proceedings of the Frst Conference on Machine Translation, WMT 2016, Colocated with ACL 2016, Berlin, Germany, pp. 131198.Google Scholar
Bojar, O., et al. (2017). Findings of the 2017 conference on machine translation (WMT17). In Proceedings of the Second Conference onMachine Translation, Volume 2: Shared Task Papers. Copenhagen, Denmark: Association for Computational Linguistics, pp. 169214.CrossRefGoogle Scholar
Castilho, S., Moorkens, J., Gaspari, F., Calixto, I., Tinsley, J. and Way, A. (2017). Is neural machine translation the new state of the art? The Prague Bulletin of Mathematical Linguistics 108(1), 109120.Google Scholar
Cho, K., van Merriënboer, B., Gülçehre, Ç., Bahdanau, D., Bougares, F., Schwenk, H. and Bengio, Y. (2014). Learning phrase representations using rnn encoder-decoder for statistical machine translation. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP). Association for Computational Linguistics, pp. 1724–1734.CrossRefGoogle Scholar
Chung, J., Gülçehre, Ç., Cho, K. and Bengio, Y. (2014). Empirical evaluation of gated recurrent neural networks on sequence modeling. In CoRR, abs/1412.3555. Retrieved from http://arxiv.org/abs/1412.3555Google Scholar
Costa, Â.,Ling, W., Luıs, T., Correia, R. and Coheur, L. (2015). A linguistically motivated taxonomy for machine translation error analysis. Machine Translation 29(2), 127161.CrossRefGoogle Scholar
Daems, J., Macken, L. and Vandepitte, S. (2014). On the origin of errors: A finegrained analysis of mt and pe errors and their relationship. In Proceedings of the International Conference on Language Resources and Evaluation (LREC). European Language Resources Association (ELRA), pp. 62–66.Google Scholar
Daems, J., Vandepitte, S., Hartsuiker, R.J. and Macken, L. (2017). Identifying the machine translation error types with the greatest impact on post-editing effort. Frontiers in Psychology 8, 1282. http://journal.frontiersin.org/article/10.3389/fpsyg.2017.01282CrossRefGoogle ScholarPubMed
de Almeida, G. (2013). Translating the post-editor: An investigation of post-editing changes and correlations with professional experience across two romance languages (Unpublished doctoral dissertation). Dublin City University.Google Scholar
Gandrabur, S. and Foster, G. (2003). Confidence estimation for translation prediction. In Proceedings of the Seventh Conference on Natural Language Learning at HLT-NAACL 2003 - Volume 4. Association for Computational Linguistics, pp. 95–102.CrossRefGoogle Scholar
Glorot, X. and Bengio, Y. (2010). Understanding the difficulty of training deep feedforward neural networks. In Proceedings of the International Conference on Artificial Intelligence and Statistics (AISTATS’10). Society for Artificial Intelligence and Statistics.Google Scholar
Graham, Y., Baldwin, T., Moffat, A. and Zobel, J. (2014). Is machine translation getting better over time? In Proceedings of the 14th Conference of the European Chapter of the Association for Computational Linguistics, pp. 443–451.Google Scholar
Hokamp, C. (2017). Ensembling factored neural machine translation models for automatic post-editing and quality estimation. In CoRR, abs/1706.05083.CrossRefGoogle Scholar
Hokamp, C., Calixto, I., Wagner, J. and Zhang, J. (2014). Target-centric features for translation quality estimation. In Proceedings of the Ninth Workshop on Statistical Machine Translation, pp. 329–334.CrossRefGoogle Scholar
Jones, K.S. and Galliers, J.R. (1995). Evaluating Natural Language Processing Systems: An Analysis and Review, vol. 1083. Germany: Springer Science & Business Media.Google Scholar
Junczys-Dowmunt, M. and Grundkiewicz, R. (2016). Log-linear combinations of monolingual and bilingual neural machine translation models for automatic post-editing. In CoRR, abs/1605.04800.CrossRefGoogle Scholar
Kim, H. and Lee, J.-H. (2016). Recurrent neural network based translation quality estimation. In Proceedings of the first conference on machine translation: Volume 2, shared task papers, pp. 787–792.CrossRefGoogle Scholar
Kim, H., Lee, J.-H. and Na, S.-H. (2017). Predictor-estimator using multilevel task learning with stack propagation for neural quality estimation. In Proceedings of the Second Conference on Machine Translation, pp. 562–568.CrossRefGoogle Scholar
Klubička, F., Toral, A. and Sánchez-Cartagena, V.M. (2017). Fine-grained human evaluation of neural versus phrase-based machine translation. The Prague Bulletin of Mathematical Linguistics 108(1), 121132.CrossRefGoogle Scholar
Koponen, M., Aziz, W., Ramos, L. and Specia, L. (2012). Post-editing time as a measure of cognitive effort. In AMTA 2012 Workshop on Post-Editing Technology and Practice (WPTP 2012). San Diego, USA, pp. 1120.Google Scholar
Kreutzer, J., Schamoni, S. and Riezler, S. (2015). QUality Estimation from ScraTCH(QUETCH): Deep learning for word-level translation quality estimation. In Proceedings of the Tenth Workshop on Statistical Machine Translation, WMT@EMNLP 2015, Lisbon, Portugal, pp. 316322.CrossRefGoogle Scholar
Kusner, M., Sun, Y., Kolkin, N. and Weinberger, K.Q. (2015). From word embeddings to document distances. In Blei, D. and & Bach, F. (eds), Proceedings of the 32nd International Conference on Machine Learning (ICML-15). JMLR Workshop and Conference Proceedings, pp. 957–966.Google Scholar
Li, J., Li, J., Fu, X., Masud, M. and Huang, J.Z. (2016). Learning distributed word representation with multicontextual mixed embedding. Knowledge-Based Systems 106, 220230. http://www.sciencedirect.com/science/article/pii/S0950705116301435; doi: http://dx.doi.org/10.1016/j.knosys.2016.05.045CrossRefGoogle Scholar
Logacheva, V., Hokamp, C. and Specia, L. (2016a). Marmot: A toolkit for translation quality estimation at the word level. In Proceedings of the 10th Edition of the Language Resources and Evaluation Conference (LREC).Google Scholar
Logacheva, V., Lukasik, M. and Specia, L. (2016b). Metrics for evaluation of word-levelmachine translation quality estimation. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (volume 2: Short papers), pp. 585–590.CrossRefGoogle Scholar
Lommel, A.R., Uszkoreit, H. and Burchardt, A. (2014). Multidimensional Quality Metrics (MQM). Tradumàtica 12, 455463.CrossRefGoogle Scholar
Ma, W. and McKeown, K. (2012). Detecting and correcting syntactic errors in machine translation using feature-based lexicalized tree adjoining grammars. IJCLCLP 17(4), pp. 114.Google Scholar
Macken, L., De Clercq, O. and Paulussen, H. (2011). Dutch parallel corpus: A balanced copyright-cleared parallel corpus. Meta: Journal des traducteursMeta:/ Translators’ Journal 56(2), 374390.CrossRefGoogle Scholar
Martins, A.F., Astudillo, R.F., Hokamp, C. and Kepler, F. (2016). Unbabel’s participation in the WMT16 word-level translation quality estimation shared task. In Proceedings of the First Conference on Machine Translation. Berlin, Germany: Association for Computational Linguistics, pp. 806811.Google Scholar
Martins, A.F., Kepler, F. and Monteiro, J. (2017). Unbabel’s participation in the WMT17 translation quality estimation shared task. In Proceedings of the Second Conference on Machine Translation, pp. 569–574.CrossRefGoogle Scholar
Mikolov, T., Chen, K., Corrado, G. and Dean, J. (2013). Efficient estimation of word representations in vector space. In CoRR, abs/1301.3781.Google Scholar
Oostdijk, N., Reynaert, M., Monachesi, P., Noord, G.V., Ordelman, R. and Schuurman, I. (2008). From DCoi to SoNaR: A reference corpus for dutch. In Proceedings of the Sixth International Conference on Language Resources and Evaluation.Google Scholar
Owczarzak, K., van Genabith, J. and Way, A. (2007). Labelled dependencies in machine translation evaluation. In Proceedings of the Second Workshop on Statistical Machine Translation. Stroudsburg, PA, USA: Association for Computational Linguistics, pp. 104111. Retrieved from http://dl.acm.org/citation.cfm?id=1626355.1626369CrossRefGoogle Scholar
Patel, R.N. and Sasikumar, M. (2016). Translation quality estimation using recurrent neural network. In CoRR, abs/1610.04841.CrossRefGoogle Scholar
Řehůřek, R. and Sojka, P. (2010). Software Framework for Topic Modelling with Large Corpora. In Proceedings of the LREC 2010 Workshop on New Challenges for NLP Frameworks. Valletta, Malta: ELRA, pp. 4550. Retrieved from http://is.muni.cz/publication/884893/enGoogle Scholar
Scarton, C., Beck, D., Shah, K., Smith, K.S. and Specia, L. (2016). Word embeddings and discourse information for machine translation quality estimation. In Proceedings of the First Conference onMachine Translation. Berlin, Germany: Association for Computational Linguistics, pp. 831837.Google Scholar
Snover, M., Dorr, B., Schwartz, R., Micciulla, L. and Makhoul, J. (2006). A study of translation edit rate with targeted human annotation. In Proceedings of Association for Machine Translation in the Americas, pp. 223231.Google Scholar
Socher, R., Lin, C.C., Ng, A.Y. and Manning, C.D. (2011a). Parsing natural scenes and natural language with recursive neural networks. In Proceedings of the 26th International Conference on Machine Learning (ICML).Google Scholar
Socher, R., Pennington, J., Huang, E.H., Ng, A.Y. and Manning, C.D. (2011b). Semi-supervised recursive autoencoders for predicting sentiment distributions. In Proceedings of the Conference on Empirical Methods in Natural Language Processing. Stroudsburg, PA, USA: Association for Computational Linguistics, pp. 151161. Retrieved from http://dl.acm.org/citation.cfm?id=2145432.2145450://dl.acm.org/citation.cfm?id=2145432.2145450Google Scholar
Specia, L., Turchi, M., Cancedda, N., Dymetman, M. and Cristianini, N. (2009). Estimating the sentence-level quality of machine translation systems. In 13th Annual Conference of the European Association for Machine Translation. Barcelona, Spain, pp. 2837. Retrieved from http://www.mt-archive.info/EAMT-2009-Specia.pdfGoogle Scholar
Specia, L., Shah, K., De Souza, J.G.C., Cohn, T. and Kessler, F.B. (2013). QuEst - A translation quality estimation framework. In Proceedings of the 51th Conference of the Association for Computational Linguistics (ACL), Demo Session.Google Scholar
Specia, L., Logacheva, V. and Scarton, C. (2016). WMT16 quality estimation shared task training and development data. Retrieved from http://hdl.handle.net/11372/LRT-1646 (LINDAT/CLARIN digital library at the Institute of Formal and Applied Linguistics, Charles University)Google Scholar
Srivastava, N., Hinton, G., Krizhevsky, A., Sutskever, I. and Salakhutdinov, R. (2014). Dropout: A simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 19291958.Google Scholar
Stymne, S. and Ahrenberg, L. (2010). Using a grammar checker for evaluation and postprocessing of statistical machine translation. In Proceedings of the Seventh Conference on International Language Resources and Evaluation (LREC’10). European Language Resources Association (ELRA).Google Scholar
Tezcan, A., Hoste, V. and Macken, L. (2016). Detecting grammatical errors in machine translation output using dependency parsing and treebank querying. Baltic Journal of Modern Computing 4(2), 203217.Google Scholar
Tezcan, A., Hoste, V. and Macken, L. (2017a). A neural network architecture for detecting grammatical errors in statistical machine translation. The Prague Bulletin of Mathematical Linguistics 108, 133145.CrossRefGoogle Scholar
Tezcan, A., Hoste, V. and Macken, L. (2017b). Scate taxonomy and corpus of machine translation errors. In Pastor, G.C. and Durán-Mu˜ñoz, I. (eds), Trends in e-Tools and Resources for Translators and Interpreters, vol. 45. Leiden, The Netherlands: Brill Rodopi, pp. 219244.Google Scholar
Tieleman, T. and Hinton, G. (2012). Lecture 6.5–RmsProp: Divide the gradient by a running average of its recent magnitude. COURSERA: Neural Networks for Machine Learning.Google Scholar
Toury, G. (2000). The nature and role of norms in translation. The Translation Studies Reader 2, 198212.Google Scholar
Turian, J., Ratinov, L. and Bengio, Y. (2010). Word representations: A simple and general method for semi-supervised learning. In Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics, pp. 384–394.Google Scholar
Ueffing, N. and Ney, H. (2005). Application of word-level confidence measures in interactive statistical machine translation. In Proceedings of EAMT 2005 10th Annual Conference of the European Association for Machine Translation, pp. 262–270.Google Scholar
Van Noord, G. (2006). At last parsing is now operational. In TALN06. Verbum ex machina. Actes de la 13e conference sur le traitement automatique des langues naturelles, pp. 20–42.Google Scholar
Vilar, D., Xu, J., D’haro, L.F. and Ney, H. (2006). Error analysis of statistical machine translation output. In Proceedings of the Fifth International Conference on Language Resources and Evaluation (LREC-2006). Genoa, Italy: European Language Resources Association (ELRA). (ACL Anthology Identifier: L06–1244)Google Scholar
White, J.S. (1995). Approaches to black box MT evaluation. In Proceedings of Machine Translation Summit V, vol. 10.Google Scholar
Wolk, K. and Marasek, K. (2015). Building subject-aligned comparable corpora and mining it for truly parallel sentence pairs. In CoRR, abs/1509.08881. Retrieved from http://arxiv.org/abs/1509.08881Google Scholar
Xu, J., Deng, Y., Gao, Y. and Ney, H. (2007). Domain dependent statistical machine translation. In Proceedings of the MT Summit XI, pp. 515–520.Google Scholar