Ad astra or astray: Exploring linguistic knowledge of multilingual BERT through NLI task

Maria Tikhonova; Vladislav Mikhailov; Dina Pisarevskaya; Valentin Malykh; Tatiana Shavrina

doi:10.1017/S1351324922000225

Ad astra or astray: Exploring linguistic knowledge of multilingual BERT through NLI task

Published online by Cambridge University Press: 09 June 2022

Maria Tikhonova

Vladislav Mikhailov ,

Dina Pisarevskaya ,

Valentin Malykh and

Tatiana Shavrina

Show author details

Maria Tikhonova*: Affiliation:
HSE University, Moscow, Russia
Vladislav Mikhailov: Affiliation:
HSE University, Moscow, Russia
Dina Pisarevskaya: Affiliation:
Independent Resercher, London, UK
Valentin Malykh: Affiliation:
Huawei Noah’s Ark Lab, Moscow, Russia
Tatiana Shavrina*: Affiliation:
HSE University, Moscow, Russia AI Research Institute (AIRI), Moscow, Russia
*: *Corresponding author: E-mail: [email protected]; [email protected]
*Corresponding author: E-mail: [email protected]; [email protected]

Article contents

Abstract
References

Get access

Rights & Permissions

Abstract

Recent research has reported that standard fine-tuning approaches can be unstable due to being prone to various sources of randomness, including but not limited to weight initialization, training data order, and hardware. Such brittleness can lead to different evaluation results, prediction confidences, and generalization inconsistency of the same models independently fine-tuned under the same experimental setup. Our paper explores this problem in natural language inference, a common task in benchmarking practices, and extends the ongoing research to the multilingual setting. We propose six novel textual entailment and broad-coverage diagnostic datasets for French, German, and Swedish. Our key findings are that the mBERT model demonstrates fine-tuning instability for categories that involve lexical semantics, logic, and predicate-argument structure and struggles to learn monotonicity, negation, numeracy, and symmetry. We also observe that using extra training data only in English can enhance the generalization performance and fine-tuning stability, which we attribute to the cross-lingual transfer capabilities. However, the ratio of particular features in the additional training data might rather hurt the performance for model instances. We are publicly releasing the datasets, hoping to foster the diagnostic investigation of language models (LMs) in a cross-lingual scenario, particularly in terms of benchmarking, which might promote a more holistic understanding of multilingualism in LMs and cross-lingual knowledge transfer.

Keywords

Evaluation Model Interpretation Multilinguality Natural Language Inference Cross-lingual learning Transfer learning

Type: Article
Information: Natural Language Engineering , Volume 29 , Issue 3 , May 2023 , pp. 554 - 583

DOI: https://doi.org/10.1017/S1351324922000225 [Opens in a new window]
Copyright: © The Author(s), 2022. Published by Cambridge University Press

Access options

Get access to the full version of this content by using one of the access options below. (Log in options will check for institutional or personal access. Content may require purchase if you do not have access.)

References

Al-Shabab, O. (1996). Interpretation and the language of translation: creativity and conventions in translation.Google Scholar

Belinkov, Y., Poliak, A., Shieber, S., Van Durme, B. and Rush, A. (2019). Don’t take the premise for granted: Mitigating artifacts in natural language inference. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. Florence, Italy: Association for Computational Linguistics, pp. 877–891.CrossRef Google Scholar

Bengio, Y. (2012). Practical recommendations for gradient-based training of deep architectures.CrossRef Google Scholar

Bentivogli, L., Clark, P., Dagan, I. and Giampiccolo, D. (2009). The fifth pascal recognizing textual entailment challenge. In TAC.Google Scholar

Bhojanapalli, S., Wilber, K., Veit, A., Rawat, A.S., Kim, S., Menon, A. and Kumar, S. (2021). On the reproducibility of neural network predictions. arXiv preprint arXiv:2102.03349.Google Scholar

Bowman, S.R., Angeli, G., Potts, C. and Manning, C.D. (2015). A large annotated corpus for learning natural language inference. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing. Lisbon, Portugal: Association for Computational Linguistics, pp. 632–642.CrossRef Google Scholar

Conneau, A., Khandelwal, K., Goyal, N., Chaudhary, V., Wenzek, G., GuzmÁn, F., Grave, E., Ott, M., Zettlemoyer, L. and Stoyanov, V. (2020). Unsupervised cross-lingual representation learning at scale. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. Online: Association for Computational Linguistics, pp. 8440–8451.CrossRef Google Scholar

Conneau, A., Kruszewski, G., Lample, G., Barrault, L. and Baroni, M. (2018a). What you can cram into a single $&!#* vector: Probing sentence embeddings for linguistic properties. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Melbourne, Australia: Association for Computational Linguistics, pp. 2126–2136.CrossRef Google Scholar

Conneau, A., Rinott, R., Lample, G., Williams, A., Bowman, S., Schwenk, H. and Stoyanov, V. (2018b). XNLI: Evaluating cross-lingual sentence representations. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing. Brussels, Belgium: Association for Computational Linguistics, pp. 2475–2485.CrossRef Google Scholar

Cui, L., Cheng, S., Wu, Y. and Zhang, Y. (2020). Does bert solve commonsense task via commonsense knowledge?Google Scholar

Dagan, I., Glickman, O. and Magnini, B. (2005). The pascal recognising textual entailment challenge. In Machine Learning Challenges Workshop. Springer, pp. 177–190.Google Scholar

Dehghani, M., Tay, Y., Gritsenko, A.A., Zhao, Z., Houlsby, N., Diaz, F., Metzler, D. and Vinyals, O. (2021). The benchmark lottery.Google Scholar

Devlin, J., Chang, M.-W., Lee, K. and Toutanova, K. (2019). BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers). Minneapolis, Minnesota: Association for Computational Linguistics, pp. 4171–4186.Google Scholar

Dodge, J., Ilharco, G., Schwartz, R., Farhadi, A., Hajishirzi, H. and Smith, N. (2020). Fine-tuning pretrained language models: Weight initializations, data orders, and early stopping. arXiv preprint arXiv:2002.06305.Google Scholar

Ettinger, A. (2020). What BERT is not: Lessons from a new suite of psycholinguistic diagnostics for language models. Transactions of the Association for Computational Linguistics, 8, 34–48.CrossRef Google Scholar

Giampiccolo, D., Magnini, B., Dagan, I. and Dolan, B. (2007). The third PASCAL recognizing textual entailment challenge. In Proceedings of the ACL-PASCAL Workshop on Textual Entailment and Paraphrasing. Prague: Association for Computational Linguistics, pp. 1–9.CrossRef Google Scholar

Glockner, M., Shwartz, V. and Goldberg, Y. (2018). Breaking NLI systems with sentences that require simple lexical inferences. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers). Melbourne, Australia: Association for Computational Linguistics, pp. 650–655.CrossRef Google Scholar

Goldberg, Y. (2019). Assessing BERT’s syntactic abilities.Google Scholar

Gorodkin, J. (2004). Comparing two k-category assignments by a k-category correlation coefficient. Computational Biology and Chemistry 28(5–6), 367–374.CrossRef Google Scholar PubMed

Haim, R.B., Dagan, I., Dolan, B., Ferro, L., Giampiccolo, D., Magnini, B. and Szpektor, I. (2006). The second pascal recognising textual entailment challenge. In Proceedings of the Second PASCAL Challenges Workshop on Recognising Textual Entailment.Google Scholar

He, P., Liu, X., Gao, J. and Chen, W. (2021). Deberta: Decoding-enhanced bert with disentangled attention.Google Scholar

Henderson, P., Islam, R., Bachman, P., Pineau, J., Precup, D. and Meger, D. (2018). Deep reinforcement learning that matters. In Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32.CrossRef Google Scholar

Hossain, M.M., Kovatchev, V., Dutta, P., Kao, T., Wei, E. and Blanco, E. (2020). An analysis of natural language inference benchmarks through the lens of negation. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP). Online: Association for Computational Linguistics, pp. 9106–9118.CrossRef Google Scholar

Hosseini, A., Reddy, S., Bahdanau, D., Hjelm, R.D., Sordoni, A. and Courville, A. (2021). Understanding by understanding not: Modeling negation in language models. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. Online: Association for Computational Linguistics, pp. 1301–1312.CrossRef Google Scholar

Hu, H., Richardson, K., Xu, L., Li, L., Kübler, S. and Moss, L.S. (2020a). Ocnli: Original chinese natural language inference. In Findings of the Association for Computational Linguistics: EMNLP 2020, pp. 3512–3526.CrossRef Google Scholar

Hu, H., Zhou, H., Tian, Z., Zhang, Y., Patterson, Y., Li, Y., Nie, Y. and Richardson, K. (2021). Investigating transfer learning in multilingual pre-trained language models through Chinese natural language inference. In Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021. Online: Association for Computational Linguistics, pp. 3770–3785.CrossRef Google Scholar

Hu, J., Ruder, S., Siddhant, A., Neubig, G., Firat, O. and Johnson, M. (2020b). Xtreme: A massively multilingual multi-task benchmark for evaluating cross-lingual generalization.Google Scholar

Hua, H., Li, X., Dou, D., Xu, C. and Luo, J. (2021). Noise stability regularization for improving BERT fine-tuning. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. Online: Association for Computational Linguistics, pp. 3229–3241.CrossRef Google Scholar

Huang, W., Liu, H. and Bowman, S.R. (2020). Counterfactually-augmented SNLI training data does not yield better generalization than unaugmented data. In Proceedings of the First Workshop on Insights from Negative Results in NLP. Online: Association for Computational Linguistics, pp. 82–87.CrossRef Google Scholar

Jawahar, G., Sagot, B. and Seddah, D. (2019). What does BERT learn about the structure of language? In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. Florence, Italy: Association for Computational Linguistics, pp. 3651–3657.Google Scholar

Kassner, N. and Schütze, H. (2020). Negated and misprimed probes for pretrained language models: Birds can talk, but cannot fly. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. Online: Association for Computational Linguistics, pp. 7811–7818.CrossRef Google Scholar

Khashabi, D., Chaturvedi, S., Roth, M., Upadhyay, S. and Roth, D. (2018). Looking beyond the surface: A challenge set for reading comprehension over multiple sentences. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers). New Orleans, Louisiana: Association for Computational Linguistics, pp. 252–262.CrossRef Google Scholar

Kingma, D.P. and Ba, J. (2014). Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980.Google Scholar

Korobov, M. (2015). Morphological analyzer and generator for Russian and Ukrainian languages. In International Conference on Analysis of Images, Social Networks and Texts AIST 2015: Analysis of Images, Social Networks and Texts, vol. 542, pp. 320–332.CrossRef Google Scholar

Kovaleva, O., Romanov, A., Rogers, A. and Rumshisky, A. (2019). Revealing the dark secrets of BERT. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP). Hong Kong, China: Association for Computational Linguistics, pp. 4365–4374.CrossRef Google Scholar

Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L. and Schwab, D. (2020). FlauBERT: Unsupervised language model pre-training for French. In Proceedings of the 12th Language Resources and Evaluation Conference. Marseille, France: European Language Resources Association, pp. 2479–2490.Google Scholar

Lee, C., Cho, K. and Kang, W. (2019). Mixout: Effective regularization to finetune large-scale pretrained language models. arXiv preprint arXiv:1909.11299.Google Scholar

Liang, Y., Duan, N., Gong, Y., Wu, N., Guo, F., Qi, W., Gong, M., Shou, L., Jiang, D., Cao, G., Fan, X., Zhang, R., Agrawal, R., Cui, E., Wei, S., Bharti, T., Qiao, Y., Chen, J.-H., Wu, W., Liu, S., Yang, F., Campos, D., Majumder, R. and Zhou, M. (2020). XGLUE: A new benchmark datasetfor cross-lingual pre-training, understanding and generation. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP). Online: Association for Computational Linguistics, pp. 6008–6018.CrossRef Google Scholar

Liška, A., Kruszewski, G. and Baroni, M. (2018). Memorize or generalize? searching for a compositional rnn in a haystack. arXiv preprint arXiv:1802.06467.Google Scholar

Liu, Y., Gu, J., Goyal, N., Li, X., Edunov, S., Ghazvininejad, M., Lewis, M. and Zettlemoyer, L. (2020). Multilingual denoising pre-training for neural machine translation. Transactions of the Association for Computational Linguistics 8, 726–742.CrossRef Google Scholar

Loshchilov, I. and Hutter, F. (2017). Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101.Google Scholar

Madhyastha, P. and Jain, R. (2019). On model stability as a function of random seed. In Proceedings of the 23rd Conference on Computational Natural Language Learning (CoNLL). Hong Kong, China: Association for Computational Linguistics, pp. 929–939.CrossRef Google Scholar

Marelli, M., Menini, S., Baroni, M., Bentivogli, L., Bernardi, R. and Zamparelli, R. (2014). A SICK cure for the evaluation of compositional distributional semantic models. In Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC’14). Reykjavik, Iceland: European Language Resources Association (ELRA), pp. 216–223.Google Scholar

McCoy, R.T., Frank, R, and Linzen, T. (2018). Revisiting the poverty of the stimulus: hierarchical generalization without a hierarchical bias in recurrent neural networks.Google Scholar

McCoy, R.T., Min, J. and Linzen, T. (2020). BERTs of a feather do not generalize together: Large variability in generalization across models with similar test set performance. In Proceedings of the Third BlackboxNLP Workshop on Analyzing and Interpreting Neural Networks for NLP. Online: Association for Computational Linguistics, pp. 217–227.CrossRef Google Scholar

McCoy, T., Pavlick, E. and Linzen, T. (2019). Right for the wrong reasons: Diagnosing syntactic heuristics in natural language inference. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. Florence, Italy: Association for Computational Linguistics, pp. 3428–3448.CrossRef Google Scholar

Merchant, A., Rahimtoroghi, E., Pavlick, E. and Tenney, I. (2020). What happens to BERT embeddings during fine-tuning? In Proceedings of the Third BlackboxNLP Workshop on Analyzing and Interpreting Neural Networks for NLP. Online: Association for Computational Linguistics, pp. 33–44.Google Scholar

Miaschi, A., Brunato, D., Dell’Orletta, F. and Venturi, G. (2020). Linguistic profiling of a neural language model. In Proceedings of the 28th International Conference on Computational Linguistics. Barcelona, Spain (Online): International Committee on Computational Linguistics, pp. 745–756.CrossRef Google Scholar

Min, J., McCoy, R.T., Das, D., Pitler, E. and Linzen, T. (2020). Syntactic data augmentation increases robustness to inference heuristics. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. Online: Association for Computational Linguistics, pp. 2339–2352.CrossRef Google Scholar

Mosbach, M., Andriushchenko, M. and Klakow, D. (2020a). On the stability of fine-tuning bert: Misconceptions, explanations, and strong baselines. In International Conference on Learning Representations.Google Scholar

Mosbach, M., Khokhlova, A., Hedderich, M.A. and Klakow, D. (2020b). On the interplay between fine-tuning and sentence-level probing for linguistic knowledge in pre-trained transformers. In Proceedings of the Third BlackboxNLP Workshop on Analyzing and Interpreting Neural Networks for NLP. Online: Association for Computational Linguistics, pp. 68–82.CrossRef Google Scholar

Naik, A., Ravichander, A., Sadeh, N., Rose, C. and Neubig, G. (2018). Stress test evaluation for natural language inference. In Proceedings of the 27th International Conference on Computational Linguistics. Santa Fe, New Mexico, USA: Association for Computational Linguistics, pp. 2340–2353.Google Scholar

Nie, Y., Williams, A., Dinan, E., Bansal, M., Weston, J. and Kiela, D. (2020). Adversarial NLI: A new benchmark for natural language understanding. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. Online: Association for Computational Linguistics, pp. 4885–4901.CrossRef Google Scholar

Nisioi, S., Rabinovich, E., Dinu, L.P. and Wintner, S. (2016). A corpus of native, non-native and translated texts. In Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC’16). Portorož, Slovenia: European Language Resources Association (ELRA), pp. 4197–4201.Google Scholar

Phang, J., Févry, T. and Bowman, S.R. (2018). Sentence encoders on stilts: Supplementary training on intermediate labeled-data tasks. arXiv preprint arXiv:1811.01088.Google Scholar

Pruksachatkun, Y., Phang, J., Liu, H., Htut, P.M., Zhang, X., Pang, R.Y., Vania, C., Kann, K. and Bowman, S.R. (2020a). Intermediate-task transfer learning with pretrained language models: When and why does it work? In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. Online: Association for Computational Linguistics, pp. 5231–5247.CrossRef Google Scholar

Pruksachatkun, Y., Yeres, P., Liu, H., Phang, J., Htut, P. M., Wang, A., Tenney, I. and Bowman, S.R. (2020b). jiant: A software toolkit for research on general-purpose text understanding models. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics: System Demonstrations. Online: Association for Computational Linguistics, pp. 109–117.CrossRef Google Scholar

Raffel, C., Shazeer, N., Roberts, A., Lee, K., Narang, S., Matena, M., Zhou, Y., Li, W. and Liu, P.J. (2020). Exploring the limits of transfer learning with a unified text-to-text transformer.Google Scholar

Richardson, K., Hu, H., Moss, L. and Sabharwal, A. (2020). Probing natural language inference models through semantic fragments. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 34, pp. 8713–8721.CrossRef Google Scholar

Rogers, A. (2019). How the transformers broke nlp leaderboards.Google Scholar

Rogers, A. (2021). Changing the world by changing the data. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers). Online: Association for Computational Linguistics, pp. 2182–2194.CrossRef Google Scholar

Rogers, A., Kovaleva, O. and Rumshisky, A. (2020). A primer in BERTology: What we know about how BERT works. Transactions of the Association for Computational Linguistics 8, 842–866.CrossRef Google Scholar

Rybak, P., Mroczkowski, R., Tracz, J. and Gawlik, I. (2020). KLEJ: Comprehensive benchmark for Polish language understanding. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. Online: Association for Computational Linguistics, pp. 1191–1201.CrossRef Google Scholar

Sanchez, I., Mitchell, J. and Riedel, S. (2018). Behavior analysis of NLI models: Uncovering the influence of three factors on robustness. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers). New Orleans, Louisiana: Association for Computational Linguistics, pp. 1975–1985.CrossRef Google Scholar

Shavrina, T., Fenogenova, A., Anton, E., Shevelev, D., Artemova, E., Malykh, V., Mikhailov, V., Tikhonova, M., Chertok, A. and Evlampiev, A. (2020). RussianSuperGLUE: A Russian language understanding evaluation benchmark. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP). Online: Association for Computational Linguistics, pp. 4717–4726.CrossRef Google Scholar

Shavrina, T. and Shapovalova, O. (2017). To the methodology of corpus construction for machine learning:“taiga”. syntax tree corpus and parser. Corpus Linguistics 2017, p. 78.Google Scholar

Singh, J., Wallat, J. and Anand, A. (2020). BERTnesia: Investigating the capture and forgetting of knowledge in BERT. In Proceedings of the Third BlackboxNLP Workshop on Analyzing and Interpreting Neural Networks for NLP. Online: Association for Computational Linguistics, pp. 174–183.CrossRef Google Scholar

Storks, S., Gao, Q. and Chai, J.Y. (2019). Recent advances in natural language inference: A survey of benchmarks, resources, and approaches. arXiv preprint arXiv:1904.01172.Google Scholar

Tanchip, C., Yu, L., Xu, A. and Xu, Y. (2020). Inferring symmetry in natural language. In Findings of the Association for Computational Linguistics: EMNLP 2020. Online: Association for Computational Linguistics, pp. 2877–2886.CrossRef Google Scholar

Thawani, A., Pujara, J., Ilievski, F. and Szekely, P. (2021). Representing numbers in NLP: a survey and a vision. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. Online: Association for Computational Linguistics, pp. 644–656.CrossRef Google Scholar

Tsuchiya, M. (2018). Performance impact caused by hidden bias of training data for recognizing textual entailment. In Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018) Miyazaki, Japan: European Language Resources Association (ELRA).Google Scholar

Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł. and Polosukhin, I. (2017). Attention is all you need. In Advances in Neural Information Processing Systems (pp. 5998–6008).Google Scholar

Venhuizen, N.J., Hendriks, P., Crocker, M.W. and Brouwer, H. (2021). Distributional formal semantics. Information and Computation, p. 104763.Google Scholar

Wallace, E., Wang, Y., Li, S., Singh, S. and Gardner, M. (2019). Do NLP models know numbers? probing numeracy in embeddings. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP). Hong Kong, China: Association for Computational Linguistics, pp. 5307–5315.CrossRef Google Scholar

Wang, A., Pruksachatkun, Y., Nangia, N., Singh, A., Michael, J., Hill, F., Levy, O. and Bowman, S. (2019). Superglue: A stickier benchmark for general-purpose language understanding systems. In Advances in Neural Information Processing Systems, pp. 3266–3280.Google Scholar

Wang, A., Singh, A., Michael, J., Hill, F., Levy, O. and Bowman, S. (2018). GLUE: A multi-task benchmark and analysis platform for natural language understanding. In Proceedings of the 2018 EMNLP Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP. Brussels, Belgium: Association for Computational Linguistics, pp. 353–355.CrossRef Google Scholar

Warstadt, A. and Bowman, S.R. (2019). Linguistic analysis of pretrained sentence encoders with acceptability judgments. arXiv preprint arXiv:1901.03438.Google Scholar

Weber, N., Shekhar, L. and Balasubramanian, N. (2018). The fine line between linguistic generalization and failure in Seq2Seq-attention models. In Proceedings of the Workshop on Generalization in the Age of Deep Learning. New Orleans, Louisiana: Association for Computational Linguistics, pp. 24–27.CrossRef Google Scholar

Williams, A., Nangia, N. and Bowman, S. (2018). A broad-coverage challenge corpus for sentence understanding through inference. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers). New Orleans, Louisiana: Association for Computational Linguistics, pp. 1112–1122.CrossRef Google Scholar

Wu, J.M., Belinkov, Y., Sajjad, H., Durrani, N., Dalvi, F. and Glass, J. (2020). Similarity analysis of contextual word representation models. arXiv preprint arXiv:2005.01172.Google Scholar

Xu, L., Hu, H., Zhang, X., Li, L., Cao, C., Li, Y., Xu, Y., Sun, K., Yu, D., Yu, C., Tian, Y., Dong, Q., Liu, W., Shi, B., Cui, Y., Li, J., Zeng, J., Wang, R., Xie, W., Li, Y., Patterson, Y., Tian, Z., Zhang, Y., Zhou, H., Liu, S., Zhao, Z., Zhao, Q., Yue, C., Zhang, X., Yang, Z., Richardson, K. and Lan, Z. (2020). CLUE: A Chinese language understanding evaluation benchmark. In Proceedings of the 28th International Conference on Computational Linguistics. Barcelona, Spain (Online): International Committee on Computational Linguistics, pp. 4762–4772.CrossRef Google Scholar

Xue, L., Constant, N., Roberts, A., Kale, M., Al-Rfou, R., Siddhant, A., Barua, A. and Raffel, C. (2021). mT5: A massively multilingual pre-trained text-to-text transformer. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. Online: Association for Computational Linguistics, pp. 483–498.CrossRef Google Scholar

Yanaka, H., Mineshima, K., Bekki, D. and Inui, K. (2020). Do neural models learn systematicity of monotonicity inference in natural language? In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. Online: Association for Computational Linguistics, pp. 6105–6117.Google Scholar

Yanaka, H., Mineshima, K., Bekki, D., Inui, K., Sekine, S., Abzianidze, L. and Bos, J. (2019a). Can neural networks understand monotonicity reasoning? In Proceedings of the 2019 ACL Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP. Florence, Italy: Association for Computational Linguistics, pp. 31–40.CrossRef Google Scholar

Yanaka, H., Mineshima, K., Bekki, D., Inui, K., Sekine, S., Abzianidze, L. and Bos, J. (2019b). HELP: A dataset for identifying shortcomings of neural models in monotonicity reasoning. In Proceedings of the Eighth Joint Conference on Lexical and Computational Semantics (*SEM 2019). Minneapolis, Minnesota: Association for Computational Linguistics, pp. 250–255.CrossRef Google Scholar

Zhang, S., Liu, X., Liu, J., Gao, J., Duh, K. and Durme, B.V. (2018). Record: Bridging the gap between human and machine commonsense reading comprehension.Google Scholar

Zhang, Y., Warstadt, A., Li, X., and Bowman, S.R. (2021). When do you need billions of words of pretraining data? In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers). Online: Association for Computational Linguistics, pp. 1112–1125.Google Scholar

Zhao, Y. and Bethard, S. (2020). How does BERT’s attention change when you fine-tune? an analysis methodology and a case study in negation scope. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. Online: Association for Computational Linguistics, pp. 4729–4747.CrossRef Google Scholar

Zhu, C., Cheng, Y., Gan, Z., Sun, S., Goldstein, T. and Liu, J. (2019). Freelb: Enhanced adversarial training for natural language understanding. In International Conference on Learning Representations.Google Scholar

Zhuang, D., Zhang, X., Song, S.L. and Hooker, S. (2021). Randomness in neural network training: Characterizing the impact of tooling. arXiv preprint arXiv:2106.11872.Google Scholar

Ad astra or astray: Exploring linguistic knowledge of multilingual BERT through NLI task – CORRIGENDUM

Maria Tikhonova Maria Tikhonova , Vladislav Mikhailov , Dina Pisarevskaya , Valentin Malykh and

Tatiana Shavrina Tatiana Shavrina

Natural Language Engineering , Volume 29 , Issue 4

Article contents

Ad astra or astray: Exploring linguistic knowledge of multilingual BERT through NLI task

Abstract

Keywords

Access options

References

A correction has been issued for this article:

Linked content

This article has been cited by the following publications. This list is generated based on data provided by Crossref.

Article contents

Ad astra or astray: Exploring linguistic knowledge of multilingual BERT through NLI task

Abstract

Keywords

Access options

References

A correction has been issued for this article:

Linked content

Save article to Kindle

Save article to Dropbox

Save article to Google Drive

Reply to: Submit a response

Your details

You have entered the maximum number of contributors

Conflicting interests