Hostname: page-component-745bb68f8f-b95js Total loading time: 0 Render date: 2025-01-08T05:08:10.487Z Has data issue: false hasContentIssue false

How to evaluate machine translation: A review of automated and human metrics

Published online by Cambridge University Press:  11 September 2019

Eirini Chatzikoumi*
Affiliation:
Instituto de Literatura y Ciencias del Lenguaje, Pontificia Universidad Católica de Valparaíso, Av. El Bosque 1290, Viña del Mar, Chile
*
*Corresponding author. Emails: [email protected], [email protected]

Abstract

This article presents the most up-to-date, influential automated, semiautomated and human metrics used to evaluate the quality of machine translation (MT) output and provides the necessary background for MT evaluation projects. Evaluation is, as repeatedly admitted, highly relevant for the improvement of MT. This article is divided into three parts: the first one is dedicated to automated metrics; the second, to human metrics; and the last, to the challenges posed by neural machine translation (NMT) regarding its evaluation. The first part includes reference translation–based metrics; confidence or quality estimation (QE) metrics, which are used as alternatives for quality assessment; and diagnostic evaluation based on linguistic checkpoints. Human evaluation metrics are classified according to the criterion of whether human judges directly express a so-called subjective evaluation judgment, such as ‘good’ or ‘better than’, or not, as is the case in error classification. The former methods are based on directly expressed judgment (DEJ); therefore, they are called ‘DEJ-based evaluation methods’, while the latter are called ‘non-DEJ-based evaluation methods’. In the DEJ-based evaluation section, tasks such as fluency and adequacy annotation, ranking and direct assessment (DA) are presented, whereas in the non-DEJ-based evaluation section, tasks such as error classification and postediting are detailed, with definitions and guidelines, thus rendering this article a useful guide for evaluation projects. Following the detailed presentation of the previously mentioned metrics, the specificities of NMT are set forth along with suggestions for its evaluation, according to the latest studies. As human translators are the most adequate judges of the quality of a translation, emphasis is placed on the human metrics seen from a translator-judge perspective to provide useful methodology tools for interdisciplinary research groups that evaluate MT systems.

Type
Survey Paper
Copyright
© Cambridge University Press 2019

Access options

Get access to the full version of this content by using one of the access options below. (Log in options will check for institutional or personal access. Content may require purchase if you do not have access.)

References

Abend, O. and Rappoport, A. (2013). Universal conceptual cognitive annotation (UCCA). In Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Sofia, Bulgaria, pp. 228238.Google Scholar
Ageeva, E., Tyers, F., Forcada, M. and Perez-Ortiz, J. (2015). Evaluating machine translation for assimilation via a gap-filling task. In Proceedings of the Conference of the European Association for Machine Translation, Antalya, Turkey, pp. 137144.Google Scholar
Amigo, E., Giménez, J., Gonzalo, J. and Màrquez, L. (2006). MT evaluation: Human-like vs. human acceptable. In Proceedings of COLING-ACL06, Joint Conference of the International Committee on Computational Linguistics and the Association for Computational Lingustics, Sydney, Australia.Google Scholar
Babych, B. and Hartley, A. (2004). Extending BLEU MT evaluation method with frequency weighting. In Proceedings of ACL (Association for Computational Linguistics), Barcelona, Spain.Google Scholar
Bahdanau, D., Cho, K. and Bengio, Y. (2015). Neural machine translation by jointly learning to align and translate. In Proceedings of ICRL 2015, San Diego, USA.Google Scholar
Banerjee, S. and Lavie, A. (2005). METEOR: An automatic metric for MT evaluation with improved correlation with human judgments. In Proceedings of the ACL Workshop on Intrinsic and Extrinsic Evaluation Measures for MT and/or Summarization, Ann Arbor, Michigan, pp. 6572.Google Scholar
Bentivogli, L., Cettolo, M., Federico, M. and Federmann, C. (2018). Machine translation human evaluation: An investigation of evaluation based on Post-editing and its relation with Direct Assessment. In Proceedings of the International Workshop on Spoken Language Translation, Bruges, Belgium, pp. 6269.Google Scholar
Berka, J., Bojar, O., Fishel, M., Popovic, M. and Zeman, D. (2012). Automatic MT error analysis: Hjerson helping Addicter. In Proceedings of the Eighth International Conference on Language Resources and Evaluation, LREC’12, Istanbul, Turkey, pp. 21582163.Google Scholar
Birch, A., Abend, O., Bojar, O. and Haddow, B. (2016). HUME: Human UCCA-based evaluation of machine translation. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, Austin, Texas, pp. 12641274.CrossRefGoogle Scholar
Blatz, J., Fitzgerald, E., Foster, G., Gandraburn, S., Goutte, C., Kulesza, A., Sanchis, A. and Ueffing, N. (2004). Confidence estimation for machine translation. In Proceedings of the 20th International Conference on Computational Linguistics (COLING), Geneva, Switzerland.Google Scholar
Bojar, O. (2011). Analyzing error types in English-Czech machine translation. Prague Bulletin of Mathematical Linguistics 95, 6376.CrossRefGoogle Scholar
Bojar, O., Federmann, C., Haddow, B., Koehn, P., Post, M. and Specia, L. (2016). Ten years of WMT evaluation campaigns: Lessons learnt. In Proceedings of the LREC 2016 Workshop “Translation Evaluation – From Fragmented Tools and Data Sets to an Integrated Ecosystem”. Available at http://www.cracking-the-language-barrier.eu/wp-content/uploads/Bojar-Federmann-etal.pdf. Google Scholar
Callison-Burch, C. (2009). Fast, cheap, and creative: Evaluating translation quality using Amazon’s Mechanical Turk. In Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing, Suntec, Singapore, pp. 286295.Google Scholar
Callison-Burch, C., Fordyce, C., Koehn, P., Monz, C. and Schroeder, J. (2007). (Meta-)evaluation of machine translation. In Proceedings of the Second Workshop on Statistical Machine Translation ’07, Prague, Czech Republic, pp. 136158.CrossRefGoogle Scholar
Callison-Burch, C., Osborne, M. and Koehn, P. (2006). Re-evaluating the role of BLEU in machine translation research. In Proceedings of the 11th Conference of the European Chapter of the Association for Computational Linguisics (EACL), Trento, Italy, pp. 249256.Google Scholar
Carl, M. and Buch-Kromann, M. (2010). Correlating translation product and translation process data of professional and student translators. In Proceedings of the Annual Conference of the European Association for Machine Translation, Saint-Raphaél, France.Google Scholar
Castagnoli, S., Ciobanu, D., Kunz, K., Volanschi, A. and Kübler, N. (2010). Designing a learner translator corpus for training purposes. In Kübler, N. (ed), Corpora, Language, Teaching and Resources: From Theory to Practice. Bern, Switzerland: Peter Lang.Google Scholar
Cho, K., Van Merriënboer, B., Bahdanau, B. and Bengio, Y. (2014). On the properties of neural machine translation: Encoder-decoder approaches. In Proceedings of SSST-8, Eighth Workshop on Syntax, Semantics and Structure in Statistical Translation, Doha, Qatar, pp. 103111.CrossRefGoogle Scholar
Coughlin, D. (2003). Correlating automated and human assessments of machine translation quality. In Proceedings of MT Summit IX, New Orleans, LA, USA, pp. 6370.Google Scholar
Doddington, G. (2002). Automatic evaluation of machine translation quality using n-gram co-occurrence statistics. In Proceedings of the 2nd Human Language Technologies Conference (HLT-02), San Diego, CA, USA, pp. 128132.Google Scholar
Dorr, B., Snover, M. and Madnani, N. (2011). Chapter 5.1 introduction. In Olive, J., McCary, J. and Christianson, C. (eds), Handbook of Natural Language Processing and Machine Translation. DARPA Global Autonomous Language Exploitation. New York: Springer, pp. 801803.Google Scholar
Dreyer, M. and Marcu, D. (2012). HyTER: Meaning-equivalent semantics for translation evaluation. In Proceedings of Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Montreal, Canada, pp. 162171.Google Scholar
Euromatrix (2007). Survey of machine translation evaluation. Statistical and Hybrid Machine Translation Between All European Languages, IST 034291, Deliverable 1.3.Google Scholar
Federmann, C. (2010). Appraise: An open-source toolkit for manual phrase-based evaluation of translations. In Proceedings of the Seventh Conference on International Language Resources and Evaluation (LREC), Valletta, Malta.Google Scholar
Federmann, C. (2018). Appraise evaluation framework for machine translation. In Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, Santa Fe, New Mexico, USA, pp. 8688.Google Scholar
Gandrabur, S. and Foster, G. (2003). Confidence estimation for translation prediction. In Proceedings of the 7th Conference on Natural Language Learning at HLT-NAACL (CONLL), Edmonton, Canada.Google Scholar
Girardi, C., Bentivogli, L., Farajian, M. and Federico, M. (2014). MT-EquAl: A toolkit for human assessment of machine translation output. In Proceedings of COLING 2014, the 25th International Conference on Computational Linguistics: System Demonstrations, Dublin, Ireland, pp. 120123.Google Scholar
Gonzàlez, M. and Giménez, J. (2014). Asiya. An open toolkit for automatic machine translation (meta-)evaluation. Technical Manual, version 3.0. TALP Research Center, LSI Department, Universitat Politècnica de Catalunya.Google Scholar
Görög, A. (2014). Quantifying and benchmarking quality: The TAUS Dynamic Quality Framework. Revista Tradumàtica: tecnologies de la traducció, Traducció i qualitat 12. ISSN: 1578–7559. Available at http://revistes.uab.cat/tradumatica.Google Scholar
Graham, Y., Baldwin, T., Moffat, A. and Zobel, J. (2013). Continuous measurement scales in human evaluation of machine translation. In Proceedings of the 7th Linguistic Annotation Workshop & Interoperability with Discourse, Sofia, Bulgaria, pp. 3341.Google Scholar
Graham, Y., Baldwin, T., Moffat, A. and Zobel, J. (2015). Can machine translation systems be evaluated by the crowd alone. Natural Language Engineering 23(1), 330.CrossRefGoogle Scholar
Han, A.L.F., Wong, D.F. and Chao, L.S. (2012). LEPOR: A robust evaluation metric for machine translation with augmented factors. In Proceedings of the 24th International Conference on Computational Linguistics (COLING 2012): Posters, Mumbai, India, pp. 441450.Google Scholar
Han, L. (2018). Machine translation evaluation resources and methods: A survey. arXiv:1605.04515v8. Cornell University Library.Google Scholar
Hassan, H., Aue, A., Chen, C., Chowdhary, V., Clark, J., Federmann, C., Huang, X., Junczys-Dowmunt, M., Lewis, W., Li, M., Liu, S., Liu, T., Luo, R., Menezes, A., Qin, T., Seide, F., Tan, X., Tian, F., Wu, L., Wu, S., Xia, Y., Zhang, D., Zhang, Z. and Zhou, M. (2018). Achieving human parity on automatic Chinese to English news translation. arXiv:1803.05567.Google Scholar
House, J. (2014). Translation Quality Assessment: Past and Present. New York: Routledge.Google Scholar
Isabelle, P., Cherry, C. and Foster, G. (2017). A challenge set approach to evaluating machine translation. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, Copenhagen, Denmark, pp. 24862496.CrossRefGoogle Scholar
Kalchbrenner, N. and Blunsom, P. (2013). Recurrent continuous translation models. In Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing, Seattle, Washington, USA, pp. 17001709.Google Scholar
Klubièka, F., Toral, A. and Sánchez-Cartagena, V. (2018). Quantitative fine-grained human evaluation of machine translation systems: A case study on English to Croatian. arXiv:1802.01451v1.Google Scholar
Koby, G.S., Fields, P., Hague, D., Lommel, A. and Melby, A. (2014). Defining translation quality. Tradumàtica 12, 413420.CrossRefGoogle Scholar
Koehn, P. (2010). Statistical Machine Translation. Cambridge: Cambridge University Press.Google Scholar
Koehn, P. and Knowles, R. (2017). Six challenges for neural machine translation. arXiv:1706.03872v1.Google Scholar
Koehn, P. and Monz, C. (2006). Manual and automatic evaluation of machine translation between European languages. In Proceedings of the 2006 Workshop on Statistical Machine Translation, New York, USA.CrossRefGoogle Scholar
Lacruz, I., Denkowski, M. and Lavie, A. (2014). Cognitive demand and cognitive effort in post-editing. In Proceedings of the Third Workshop on Post-Editing Technology and Practice. 11th Conference of the Association for Machine Translation in the Americas, Vancouver, BC, Canada.Google Scholar
Läubli, S., Sennrich, R. and Volk, M. (2018). Has machine translation achieved human parity? A case for document-level evaluation. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, Brussels, Belgium, pp. 47914796.CrossRefGoogle Scholar
Lavie, A. (2011). Evaluating the output of machine translation systems. In Proceedings of the 13th MT Summit, Xiamen, China.Google Scholar
Leusch, G, Ueffing, N. and Ney, H. (2003). A novel string-to-string distance measure with applications to machine translation evaluation. In Proceedings of MT Summit IX, New Orleans, LA, USA.Google Scholar
Leusch, G, Ueffing, N. and Ney, H. (2006). CDER: Efficient MT evaluation using block movements. In Proceedings of the 11th Conference of the European Chapter of the Association for Computational Linguistics, Trento, Italy.Google Scholar
Levenshtein, V. (1966). Binary codes capable of correcting deletions, insertions, and reversals. Soviet Physics – Doklady 10(8), 707710. Original in Russian 1965.Google Scholar
Lin, C.Y. and Och, F.J. (2004). Automatic evaluation of machine translation quality using longest common subsequence and skip-bigram statistics. In Proceedings of the 42nd Meeting of the Association for Computational Linguistics (ACL-04), Main Volume, Barcelona, Spain, pp. 605612.CrossRefGoogle Scholar
Lita, L.V., Rogatti, M. and Lavie, A. (2005). BLANC: Learning evaluation metrics for MT. In Proceedings of Human Language Technology Conference and Conference on Empirical Methods in Natural Language Processing (HLT/EMNLP), Vancouver, Canada, pp. 740747.CrossRefGoogle Scholar
Lommel, A., Popović, M. and Burchardt, A. (2014). Assessing inter-annotator agreement for translation error annotation. In Proceedings of LREC Workshop on Automatic and manual Metrics for Operational Translation Evaluation, Reykjavik, Iceland.Google Scholar
Martins, A., Junczys-Dowmunt, M., Kepler, F., Astudillo, R., Hokamp, C. and Grundkiewicz, R. (2017). Pushing the limits of translation quality estimation. Transactions of the Association for Computational Linguistics 5, 205218.CrossRefGoogle Scholar
Massardo, I., Van der Meer, J., O’Brien, S., Hollowood, F., Aranberri, N. and Drescher, K. (2016). MT Post-Editing Guidelines. The Netherlands: TAUS Signature Editions.Google Scholar
Melamed, I., Green, R. and Turian, J. (2003). Precision and recall of machine translation. In Proceedings of the HLT-NAACL 2003, Edmonton, Canada.CrossRefGoogle Scholar
Navarro, G. (2001). A guided tour to approximate string matching. ACM Computing Surveys 33(1), 3188.CrossRefGoogle Scholar
Newmark, P. (1988). A Textbook of Translation. Essex: Pearson Education Limited.Google Scholar
Niessen, S., Och, F., Leusch, G. and Ney, H. (2000). An evaluation tool for machine translation: Fast evaluation for MT research. In Proceedings of the 2nd International Conference on Language Resources and Evaluation, Athens, Greece.Google Scholar
Nord, C. (1997). Translating as a Purposeful Activity: Functionalist Approaches Explained. Manchester: St. Jerome.Google Scholar
Papineni, K., Roukos, S., Ward, T. and Zhu, W.J. (2002). BLEU: A method for automatic evaluation of machine translation. In Proceedings of ACL-2002: 40th Annual meeting of the Association for Computational Linguistics, Philadelphia, Pennsylvania, USA, pp. 311318. CiteSeerX: 10.1.1.19.9416Google Scholar
Popović, M. (2015). CHRF: Character n-gram F-score for automatic MT evaluation. In Proceedings of the Tenth Workshop on Statistical Machine Translation, Lisbon, Portugal, pp. 392395.CrossRefGoogle Scholar
Popović, M. (2018). Error classification and analysis for machine translation quality assessment. In Moorkens, J., Castilho, S., Gaspari, F. and Doherty, S. (eds), Translation Quality Assessment. From Principles to Practice. Cham, Switzerland: Springer.Google Scholar
Popović, M. and Ney, H. (2007). Word error rates: Decomposition over POS classes and applications for error analysis. In Proceedings of the Second Workshop on Statistical Machine Translation, Association for Computational Linguistics, Prague, Czech Republic, pp. 4855.CrossRefGoogle Scholar
Popović, M. and Ney, H. (2011). Towards automatic error analysis of machine translation output. Computational Linguistics 37(1), 657688.CrossRefGoogle Scholar
Przybocki, M., Le, A., Sanders, G., Bronsart, S., Strassel, S. and Glenn, M. (2011). Chapter 5.4.3 Post-editing. In Olive, J., McCary, J. and Christianson, C. (eds), Handbook of Natural Language Processing and Machine Translation. DARPA Global Autonomous Language Exploitation. New York: Springer.Google Scholar
Przybocki, M., Peterson, K., Bronsart, S. and Sanders, G. (2009). The NIST 2008 Metrics for Machine Translation Challenge – Overview, Methodology, Metrics, and Results. Gaithersburg MD, USA: Multimodal Information Group, National Institute of Standards and Technology.CrossRefGoogle Scholar
Quirk, C.B. (2004). Training a sentence-level machine translation confidence measure. In Proceedings of the 4th Conference on Language Resources and Evaluation, Lisbon, Portugal, pp. 825828.Google Scholar
Ricoeur, P. (2003). Sur la traduction. Paris: Bayard.Google Scholar
Sánchez-Gijón, P. and Torres-Hostench, O. (2014). MT post-editing into the mother tongue or into a foreign language? Spanish-to-English MT translation output post-edited by translation trainees. In Proceedings of the Third Workshop on Post-Editing Technology and Practice, 11th Conference of the Association for Machine Translation in the Americas (AMTA), Vancouver, Canada, pp. 519.Google Scholar
Sanders, G., Przybocki, M., Madnani, N. and Snover, M. (2011). Chapter 5.1.2 human subjective judgments. In Olive, J., McCary, J. and Christianson, C. (eds), Handbook of Natural Language Processing and Machine Translation. DARPA Global Autonomous Language Exploitation. New York: Springer, pp. 806807.Google Scholar
Sennrich, R. (2017). How grammatical is character-level neural machine translation? Assessing MT quality with contrastive translation pairs. arXiv:1612.04629v3.Google Scholar
Snover, M., Dorr, B., Schwartz, R., Micciulla, L. and Makhoul, J. (2006). A study of translation edit rate with targeted human annotation. In Proceedings of the 7th Conference of the Association for Machine Translation in the Americas, Boston Marriott, Cambridge, Massachusetts, USA.Google Scholar
Specia, L., Raj, D. and Turchi, M. (2010). Machine translation evaluation versus quality estimation. Machine Translation 24, 3950. Springer Science+Business Media B.V. doi:10.1007/s10590-010-9077-2.CrossRefGoogle Scholar
Specia, L., Shah, K., De Souza, J.G.C. and Cohn, T. (2013). QuEst – A translation quality estimation framework. In Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics, Sofia, Bulgaria, pp. 7984.Google Scholar
Specia, L., Turchi, M., Cancedda, N., Dymetman, M. and Cristianini, N. (2009). Estimating the Sentence Level Quality of Machine Translation Systems. In EAMT09, Barcelona, Spain, pp. 2837.Google Scholar
Sutskever, I., Vinyals, O. and Le, Q. (2014). Sequence to sequence learning with neural networks. In Proceedings of Advances in Neural Information Processing Systems, Montreal, Canada, pp. 31043112.Google Scholar
Temnikova, I. (2010). A cognitive evaluation approach for a controlled language post-editing experiment. In Proceedings of International Conference Language Resources and Evaluation (LREC2010), Valletta, Malta.Google Scholar
Tomás, J., Mas, J.A. and Casacuberta, F. (2003). A quantitative method for machine translation evaluation. In Proceedings of the EACL 2003 Workshop on Evaluation Initiatives in Natural Language Processing: Are Evaluation Methods, Metrics and Resources Reusable?, Budapest, Hungary.Google Scholar
Toral, A., Castilho, S., Hu, K. and Way, A. (2018). Attaining the unattainable? Reassessing claims of human parity in neural machine translation. In Proceedings of the Third Conference on Machine Translation (WMT), Volume 1: Research Papers, Association for Computational Linguistics, Brussels, Belgium, pp. 113123.CrossRefGoogle Scholar
Turing, A. (1950). Computing machinery and intelligence. Mind 49, 433460.CrossRefGoogle Scholar
Ueffing, N. and Ney, H. (2005). Application of word-level confidence measures in interactive statistical machine translation. In Proceedings of the 10th Conference of the European Association for Machine Translation, Budapest, Hungary, pp. 262270.Google Scholar
Wisniewski, G., Kumar Singh, A. and Yvon, F. (2012). Quality estimation for machine translation: Some lessons learned. Machine Translation 27(3–4), 213238. doi:10.1007/s10590-013-9141-9.CrossRefGoogle Scholar
Wu, Y., Schuster, M., Chen, Z., Le, Q.V., Norouzi, M., Macherey, W., Krikun, M., Cao, Y., Gao, Q., Macherey, K., Klingner, J., Shah, A., Johnson, M., Liu, X., Kaiser, L., Gouws, S., Kato, Y., Kudo, T., Kazawa, H., Stevens, K., Kurian, G., Patil, N., Wang, W., Young, C., Smith, J., Riesa, J., Rudnick, A., Vinyals, O., Corrado, G., Hughes, M. and Dean, J. (2016). Google’s neural machine translation system: Bridging the gap between human and machine translation. CoRR abs/1609.08144. Available at: http://arxiv.org/abs/1609.08144.Google Scholar
Zhou, L., Lin, C.-Y. and Hovy, E. (2006). Re-evaluating machine translation results with paraphrase support. In Proceedings of the 2006 Conference on Empirical Methods in Natural Language Processing (EMNLP), Sydney, Australia.CrossRefGoogle Scholar
Zhou, M., Wang, B., Liu, S., Li, M., Zhang, D. and Zhao, T. (2008). Diagnostic evaluation of machine translation systems using automatically constructed linguistic check-points. In Proceedings of the 22nd International Conference on Computational Linguistics (Coling 2008), Manchester, United Kingdom, pp. 11211128.CrossRefGoogle Scholar