Estimating word-level quality of statistical machine translation output using monolingual information alone

Arda Tezcan; Véronique Hoste; Lieve Macken

doi:10.1017/S1351324919000111

Estimating word-level quality of statistical machine translation output using monolingual information alone

Published online by Cambridge University Press: 27 March 2019

Arda Tezcan ,

Véronique Hoste and

Lieve Macken

Show author details

Arda Tezcan*: Affiliation:
LT3, Language and Translation Technology Team, Department of Translation, Interpreting and Communication, Ghent University, Ghent, Belgium
Véronique Hoste: Affiliation:
LT3, Language and Translation Technology Team, Department of Translation, Interpreting and Communication, Ghent University, Ghent, Belgium
Lieve Macken: Affiliation:
LT3, Language and Translation Technology Team, Department of Translation, Interpreting and Communication, Ghent University, Ghent, Belgium
*: *Corresponding author. Email: [email protected]

Article contents

Abstract
References

Get access

Rights & Permissions

Abstract

Various studies show that statistical machine translation (SMT) systems suffer from fluency errors, especially in the form of grammatical errors and errors related to idiomatic word choices. In this study, we investigate the effectiveness of using monolingual information contained in the machine-translated text to estimate word-level quality of SMT output. We propose a recurrent neural network architecture which uses morpho-syntactic features and word embeddings as word representations within surface and syntactic n-grams. We test the proposed method on two language pairs and for two tasks, namely detecting fluency errors and predicting overall post-editing effort. Our results show that this method is effective for capturing all types of fluency errors at once. Moreover, on the task of predicting post-editing effort, while solely relying on monolingual information, it achieves on-par results with the state-of-the-art quality estimation systems which use both bilingual and monolingual information.

Keywords

Machine translation Quality estimation Neural networks

Type: Article
Information: Natural Language Engineering , Volume 26 , Issue 1 , January 2020 , pp. 73 - 94

DOI: https://doi.org/10.1017/S1351324919000111 [Opens in a new window]
Copyright: © Cambridge University Press 2019

Access options

Get access to the full version of this content by using one of the access options below. (Log in options will check for institutional or personal access. Content may require purchase if you do not have access.)

References

Abadi, M., et al. (2016). Tensorflow: Large-Scale machine learning on heterogeneous distributed systems. In CoRR, abs/1603.04467.Google Scholar

Abdelsalam, A., Bojar, O. and El-Beltagy, S. (2016). Bilingual embeddings and word alignments for translation quality estimation. In Proceedings of the First Conference on Machine Translation. Berlin, Germany: Association for Computational Linguistics, pp. 764–771.Google Scholar

Anastasakos, T., Kim, Y.-B. and Deoras, A. (2014). Task specific continuous word representations for mono and multilingual spoken language understanding. In 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 3246–3250CrossRef Google Scholar

Avraham, O. and Goldberg, Y. (2017). The interplay of semantics and morphology in word embeddings. In CoRR, abs/1704.01938. Retrieved from http://arxiv.org/abs/1704.01938 CrossRef Google Scholar

Avramidis, E. (2017). Comparative quality estimation for machine translation observations on machine learning and features. The Prague Bulletin of Mathematical Linguistics 108(1), 307–318.CrossRef Google Scholar

Axelrod, A., He, X. and Gao, J. (2011). Domain adaptation via pseudo in-domain data selection. In Proceedings of the conference on empirical methods in natural language processing (pp. 355–362). Stroudsburg, PA, USA: Association for Computational Linguistics. Retrieved from http://dl.acm.org/citation.cfm?id=2145432.2145474 Google Scholar

Bahdanau, D., Cho, K. and Bengio, Y. (2014). Neural machine translation by jointly learning to align and translate. In CoRR, abs/1409.0473. Retrieved from http://arxiv.org/abs/1409.0473 Google Scholar

Bentivogli, L., Bisazza, A., Cettolo, M. and Federico, M. (2016). Neural versus phrase-based machine translation quality: A case study. In CoRR, abs/1608.04631.CrossRef Google Scholar

Bertoldi, N. and Federico, M. (2009). Domain adaptation for statistical machine translation with monolingual resources. In Proceedings of the Fourth Workshop on Statistical Machine Translation. Stroudsburg, PA, USA: Association for Computational Linguistics, pp. 182–189. Retrieved from http://dl.acm.org/citation.cfm?id=1626431.1626468 CrossRef Google Scholar

Blain, F., Scarton, C. and Specia, L. (2017). Bilexical embeddings for quality estimation. In Proceedings of the Second Conference on Machine Translation, pp. 545–550.CrossRef Google Scholar

Blatz, J., et al. (2004). Confidence estimation for machine translation. In Proceedings of the 20th International Conference on Computational Linguistics. Stroudsburg, PA, USA: Association for Computational Linguistics. Retrieved from https://doi.org/10.3115/1220355.1220401 Google Scholar

Bohnet, B. and Nivre, J. (2012). A transition-based system for joint part-of speech tagging and labeled non-projective dependency parsing. In Proceedings of the 2012 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning. Association for Computational Linguistics, pp. 1455–1465.Google Scholar

Bojar, O., et al. (2014). Findings of the 2014 workshop on statisticalmachine translation. In Proceedings of the Ninth Workshop on Statistical Machine Translation, pp. 12–58.CrossRef Google Scholar

Bojar, O., et al. (2015). Findings of the 2015 workshop on statisticalmachine translation. In Proceedings of the Tenth Workshop on Statistical Machine Translation. Lisbon, Portugal: Association for Computational Linguistics, pp. 1–46. Retrieved from http://aclweb.org/anthology/W15-3001 CrossRef Google Scholar

Bojar, O., et al. (2016). Findings of the 2016 conference on machine translation. In Proceedings of the Frst Conference on Machine Translation, WMT 2016, Colocated with ACL 2016, Berlin, Germany, pp. 131–198.Google Scholar

Bojar, O., et al. (2017). Findings of the 2017 conference on machine translation (WMT17). In Proceedings of the Second Conference onMachine Translation, Volume 2: Shared Task Papers. Copenhagen, Denmark: Association for Computational Linguistics, pp. 169–214.CrossRef Google Scholar

Castilho, S., Moorkens, J., Gaspari, F., Calixto, I., Tinsley, J. and Way, A. (2017). Is neural machine translation the new state of the art? The Prague Bulletin of Mathematical Linguistics 108(1), 109–120.Google Scholar

Cho, K., van Merriënboer, B., Gülçehre, Ç., Bahdanau, D., Bougares, F., Schwenk, H. and Bengio, Y. (2014). Learning phrase representations using rnn encoder-decoder for statistical machine translation. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP). Association for Computational Linguistics, pp. 1724–1734.CrossRef Google Scholar

Chung, J., Gülçehre, Ç., Cho, K. and Bengio, Y. (2014). Empirical evaluation of gated recurrent neural networks on sequence modeling. In CoRR, abs/1412.3555. Retrieved from http://arxiv.org/abs/1412.3555 Google Scholar

Costa, Â.,Ling, W., Luıs, T., Correia, R. and Coheur, L. (2015). A linguistically motivated taxonomy for machine translation error analysis. Machine Translation 29(2), 127–161.CrossRef Google Scholar

Daems, J., Macken, L. and Vandepitte, S. (2014). On the origin of errors: A finegrained analysis of mt and pe errors and their relationship. In Proceedings of the International Conference on Language Resources and Evaluation (LREC). European Language Resources Association (ELRA), pp. 62–66.Google Scholar

Daems, J., Vandepitte, S., Hartsuiker, R.J. and Macken, L. (2017). Identifying the machine translation error types with the greatest impact on post-editing effort. Frontiers in Psychology 8, 1282. http://journal.frontiersin.org/article/10.3389/fpsyg.2017.01282 CrossRef Google Scholar PubMed

de Almeida, G. (2013). Translating the post-editor: An investigation of post-editing changes and correlations with professional experience across two romance languages (Unpublished doctoral dissertation). Dublin City University.Google Scholar

Gandrabur, S. and Foster, G. (2003). Confidence estimation for translation prediction. In Proceedings of the Seventh Conference on Natural Language Learning at HLT-NAACL 2003 - Volume 4. Association for Computational Linguistics, pp. 95–102.CrossRef Google Scholar

Glorot, X. and Bengio, Y. (2010). Understanding the difficulty of training deep feedforward neural networks. In Proceedings of the International Conference on Artificial Intelligence and Statistics (AISTATS’10). Society for Artificial Intelligence and Statistics.Google Scholar

Graham, Y., Baldwin, T., Moffat, A. and Zobel, J. (2014). Is machine translation getting better over time? In Proceedings of the 14th Conference of the European Chapter of the Association for Computational Linguistics, pp. 443–451.Google Scholar

Hokamp, C. (2017). Ensembling factored neural machine translation models for automatic post-editing and quality estimation. In CoRR, abs/1706.05083.CrossRef Google Scholar

Hokamp, C., Calixto, I., Wagner, J. and Zhang, J. (2014). Target-centric features for translation quality estimation. In Proceedings of the Ninth Workshop on Statistical Machine Translation, pp. 329–334.CrossRef Google Scholar

Jones, K.S. and Galliers, J.R. (1995). Evaluating Natural Language Processing Systems: An Analysis and Review, vol. 1083. Germany: Springer Science & Business Media.Google Scholar

Junczys-Dowmunt, M. and Grundkiewicz, R. (2016). Log-linear combinations of monolingual and bilingual neural machine translation models for automatic post-editing. In CoRR, abs/1605.04800.CrossRef Google Scholar

Kim, H. and Lee, J.-H. (2016). Recurrent neural network based translation quality estimation. In Proceedings of the first conference on machine translation: Volume 2, shared task papers, pp. 787–792.CrossRef Google Scholar

Kim, H., Lee, J.-H. and Na, S.-H. (2017). Predictor-estimator using multilevel task learning with stack propagation for neural quality estimation. In Proceedings of the Second Conference on Machine Translation, pp. 562–568.CrossRef Google Scholar

Klubička, F., Toral, A. and Sánchez-Cartagena, V.M. (2017). Fine-grained human evaluation of neural versus phrase-based machine translation. The Prague Bulletin of Mathematical Linguistics 108(1), 121–132.CrossRef Google Scholar

Koponen, M., Aziz, W., Ramos, L. and Specia, L. (2012). Post-editing time as a measure of cognitive effort. In AMTA 2012 Workshop on Post-Editing Technology and Practice (WPTP 2012). San Diego, USA, pp. 11–20.Google Scholar

Kreutzer, J., Schamoni, S. and Riezler, S. (2015). QUality Estimation from ScraTCH(QUETCH): Deep learning for word-level translation quality estimation. In Proceedings of the Tenth Workshop on Statistical Machine Translation, WMT@EMNLP 2015, Lisbon, Portugal, pp. 316–322.CrossRef Google Scholar

Kusner, M., Sun, Y., Kolkin, N. and Weinberger, K.Q. (2015). From word embeddings to document distances. In Blei, D. and & Bach, F. (eds), Proceedings of the 32nd International Conference on Machine Learning (ICML-15). JMLR Workshop and Conference Proceedings, pp. 957–966.Google Scholar

Li, J., Li, J., Fu, X., Masud, M. and Huang, J.Z. (2016). Learning distributed word representation with multicontextual mixed embedding. Knowledge-Based Systems 106, 220–230. http://www.sciencedirect.com/science/article/pii/S0950705116301435; doi: http://dx.doi.org/10.1016/j.knosys.2016.05.045 CrossRef Google Scholar

Logacheva, V., Hokamp, C. and Specia, L. (2016a). Marmot: A toolkit for translation quality estimation at the word level. In Proceedings of the 10th Edition of the Language Resources and Evaluation Conference (LREC).Google Scholar

Logacheva, V., Lukasik, M. and Specia, L. (2016b). Metrics for evaluation of word-levelmachine translation quality estimation. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (volume 2: Short papers), pp. 585–590.CrossRef Google Scholar

Lommel, A.R., Uszkoreit, H. and Burchardt, A. (2014). Multidimensional Quality Metrics (MQM). Tradumàtica 12, 455–463.CrossRef Google Scholar

Ma, W. and McKeown, K. (2012). Detecting and correcting syntactic errors in machine translation using feature-based lexicalized tree adjoining grammars. IJCLCLP 17(4), pp. 1–14.Google Scholar

Macken, L., De Clercq, O. and Paulussen, H. (2011). Dutch parallel corpus: A balanced copyright-cleared parallel corpus. Meta: Journal des traducteursMeta:/ Translators’ Journal 56(2), 374–390.CrossRef Google Scholar

Martins, A.F., Astudillo, R.F., Hokamp, C. and Kepler, F. (2016). Unbabel’s participation in the WMT16 word-level translation quality estimation shared task. In Proceedings of the First Conference on Machine Translation. Berlin, Germany: Association for Computational Linguistics, pp. 806–811.Google Scholar

Martins, A.F., Kepler, F. and Monteiro, J. (2017). Unbabel’s participation in the WMT17 translation quality estimation shared task. In Proceedings of the Second Conference on Machine Translation, pp. 569–574.CrossRef Google Scholar

Mikolov, T., Chen, K., Corrado, G. and Dean, J. (2013). Efficient estimation of word representations in vector space. In CoRR, abs/1301.3781.Google Scholar

Oostdijk, N., Reynaert, M., Monachesi, P., Noord, G.V., Ordelman, R. and Schuurman, I. (2008). From DCoi to SoNaR: A reference corpus for dutch. In Proceedings of the Sixth International Conference on Language Resources and Evaluation.Google Scholar

Owczarzak, K., van Genabith, J. and Way, A. (2007). Labelled dependencies in machine translation evaluation. In Proceedings of the Second Workshop on Statistical Machine Translation. Stroudsburg, PA, USA: Association for Computational Linguistics, pp. 104–111. Retrieved from http://dl.acm.org/citation.cfm?id=1626355.1626369 CrossRef Google Scholar

Patel, R.N. and Sasikumar, M. (2016). Translation quality estimation using recurrent neural network. In CoRR, abs/1610.04841.CrossRef Google Scholar

Řehůřek, R. and Sojka, P. (2010). Software Framework for Topic Modelling with Large Corpora. In Proceedings of the LREC 2010 Workshop on New Challenges for NLP Frameworks. Valletta, Malta: ELRA, pp. 45–50. Retrieved from http://is.muni.cz/publication/884893/en Google Scholar

Scarton, C., Beck, D., Shah, K., Smith, K.S. and Specia, L. (2016). Word embeddings and discourse information for machine translation quality estimation. In Proceedings of the First Conference onMachine Translation. Berlin, Germany: Association for Computational Linguistics, pp. 831–837.Google Scholar

Snover, M., Dorr, B., Schwartz, R., Micciulla, L. and Makhoul, J. (2006). A study of translation edit rate with targeted human annotation. In Proceedings of Association for Machine Translation in the Americas, pp. 223–231.Google Scholar

Socher, R., Lin, C.C., Ng, A.Y. and Manning, C.D. (2011a). Parsing natural scenes and natural language with recursive neural networks. In Proceedings of the 26th International Conference on Machine Learning (ICML).Google Scholar

Socher, R., Pennington, J., Huang, E.H., Ng, A.Y. and Manning, C.D. (2011b). Semi-supervised recursive autoencoders for predicting sentiment distributions. In Proceedings of the Conference on Empirical Methods in Natural Language Processing. Stroudsburg, PA, USA: Association for Computational Linguistics, pp. 151–161. Retrieved from http://dl.acm.org/citation.cfm?id=2145432.2145450://dl.acm.org/citation.cfm?id=2145432.2145450 Google Scholar

Specia, L., Turchi, M., Cancedda, N., Dymetman, M. and Cristianini, N. (2009). Estimating the sentence-level quality of machine translation systems. In 13th Annual Conference of the European Association for Machine Translation. Barcelona, Spain, pp. 28–37. Retrieved from http://www.mt-archive.info/EAMT-2009-Specia.pdf Google Scholar

Specia, L., Shah, K., De Souza, J.G.C., Cohn, T. and Kessler, F.B. (2013). QuEst - A translation quality estimation framework. In Proceedings of the 51th Conference of the Association for Computational Linguistics (ACL), Demo Session.Google Scholar

Specia, L., Logacheva, V. and Scarton, C. (2016). WMT16 quality estimation shared task training and development data. Retrieved from http://hdl.handle.net/11372/LRT-1646 (LINDAT/CLARIN digital library at the Institute of Formal and Applied Linguistics, Charles University)Google Scholar

Srivastava, N., Hinton, G., Krizhevsky, A., Sutskever, I. and Salakhutdinov, R. (2014). Dropout: A simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958.Google Scholar

Stymne, S. and Ahrenberg, L. (2010). Using a grammar checker for evaluation and postprocessing of statistical machine translation. In Proceedings of the Seventh Conference on International Language Resources and Evaluation (LREC’10). European Language Resources Association (ELRA).Google Scholar

Tezcan, A., Hoste, V. and Macken, L. (2016). Detecting grammatical errors in machine translation output using dependency parsing and treebank querying. Baltic Journal of Modern Computing 4(2), 203–217.Google Scholar

Tezcan, A., Hoste, V. and Macken, L. (2017a). A neural network architecture for detecting grammatical errors in statistical machine translation. The Prague Bulletin of Mathematical Linguistics 108, 133–145.CrossRef Google Scholar

Tezcan, A., Hoste, V. and Macken, L. (2017b). Scate taxonomy and corpus of machine translation errors. In Pastor, G.C. and Durán-Mu˜ñoz, I. (eds), Trends in e-Tools and Resources for Translators and Interpreters, vol. 45. Leiden, The Netherlands: Brill Rodopi, pp. 219–244.Google Scholar

Tieleman, T. and Hinton, G. (2012). Lecture 6.5–RmsProp: Divide the gradient by a running average of its recent magnitude. COURSERA: Neural Networks for Machine Learning.Google Scholar

Toury, G. (2000). The nature and role of norms in translation. The Translation Studies Reader 2, 198–212.Google Scholar

Turian, J., Ratinov, L. and Bengio, Y. (2010). Word representations: A simple and general method for semi-supervised learning. In Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics, pp. 384–394.Google Scholar

Ueffing, N. and Ney, H. (2005). Application of word-level confidence measures in interactive statistical machine translation. In Proceedings of EAMT 2005 10th Annual Conference of the European Association for Machine Translation, pp. 262–270.Google Scholar

Van Noord, G. (2006). At last parsing is now operational. In TALN06. Verbum ex machina. Actes de la 13e conference sur le traitement automatique des langues naturelles, pp. 20–42.Google Scholar

Vilar, D., Xu, J., D’haro, L.F. and Ney, H. (2006). Error analysis of statistical machine translation output. In Proceedings of the Fifth International Conference on Language Resources and Evaluation (LREC-2006). Genoa, Italy: European Language Resources Association (ELRA). (ACL Anthology Identifier: L06–1244)Google Scholar

White, J.S. (1995). Approaches to black box MT evaluation. In Proceedings of Machine Translation Summit V, vol. 10.Google Scholar

Wolk, K. and Marasek, K. (2015). Building subject-aligned comparable corpora and mining it for truly parallel sentence pairs. In CoRR, abs/1509.08881. Retrieved from http://arxiv.org/abs/1509.08881 Google Scholar

Xu, J., Deng, Y., Gao, Y. and Ney, H. (2007). Domain dependent statistical machine translation. In Proceedings of the MT Summit XI, pp. 515–520.Google Scholar

Article contents

Estimating word-level quality of statistical machine translation output using monolingual information alone

Abstract

Keywords

Access options

References

Save article to Kindle

Save article to Dropbox

Save article to Google Drive

Reply to: Submit a response

Your details

You have entered the maximum number of contributors

Conflicting interests