Hostname: page-component-cd9895bd7-dzt6s Total loading time: 0 Render date: 2024-12-23T06:41:11.707Z Has data issue: false hasContentIssue false

Evaluate Similarity of Requirements with Multilingual Natural Language Processing

Published online by Cambridge University Press:  26 May 2022

U. Bisang
Affiliation:
Fraunhofer IPK, Germany
J. Brünnhäußer*
Affiliation:
Fraunhofer IPK, Germany
P. Lünnemann
Affiliation:
Fraunhofer IPK, Germany
L. Kirsch
Affiliation:
CONTACT Software GmbH, Germany
K. Lindow
Affiliation:
Fraunhofer IPK, Germany

Abstract

Core share and HTML view are not available for this content. However, as you have access to this content, a full PDF is available via the ‘Save PDF’ action button.

Finding redundant requirements or semantically similar ones in previous projects is a very time-consuming task in engineering design, especially with multilingual data. Due to modern NLP it is possible to automate such tasks. In this paper we compared different multilingual embeddings models to see which of them is the most suitable to find similar requirements in English and German. The comparison was done for both in-domain data (requirements pairs) and out-of-domain data (general sentence pairs). The most suitable model were sentence embeddings learnt with knowledge distillation.

Type
Article
Creative Commons
Creative Common License - CCCreative Common License - BYCreative Common License - NCCreative Common License - ND
This is an Open Access article, distributed under the terms of the Creative Commons Attribution-NonCommercial-NoDerivatives licence (http://creativecommons.org/licenses/by-nc-nd/4.0/), which permits non-commercial re-use, distribution, and reproduction in any medium, provided the original work is unaltered and is properly cited. The written permission of Cambridge University Press must be obtained for commercial re-use or in order to create a derivative work.
Copyright
The Author(s), 2022.

References

Ferrari, A., Spagnolo, G. O. and Gnesi, S. (2017), “PURE: A Dataset of Public Requirements Documents”, available at: https://www.researchgate.net/publication/320028192_PURE_A_Dataset_of_Public_Requirements_Documents.Google Scholar
Agirre, E., Banea, C., Cer, D., Diab, M., Gonzalez-Agirre, A., Mihalcea, R., Rigau, G. and Wiebe, J. (2016), “SemEval-2016 Task 1: Semantic Textual Similarity, Monolingual and Cross-Lingual Evaluation”, in Proceedings of the 10th International Workshop on Semantic Evaluation (SemEval-2016), Association for Computational Linguistics, San Diego, California, pp. 497511.CrossRefGoogle Scholar
Agirre, E., Cer, D., Diab, M. and Gonzalez-Agirre, A. (2012), “SemEval-2012 Task 6: A Pilot on Semantic Textual Similarity”, in *SEM 2012: The First Joint Conference on Lexical and Computational Semantics - Volume 1: Proceedings of the main conference and the shared task, and Volume 2: Proceedings of the Sixth International Workshop on Semantic Evaluation (SemEval 2012), Association for Computational Linguistics, Montréal, Canada, pp. 385393.Google Scholar
Arora, S., Liang, Y. and Ma, T. (Eds.) (2019), A simple but tough-to-beat baseline for sentence embeddings.Google Scholar
Bojanowski, P., Grave, E., Joulin, A. and Mikolov, T. (2016), Enriching Word Vectors with Subword Information, available at: http://arxiv.org/pdf/1607.04606v2.Google Scholar
Conneau, A., Khandelwal, K., Goyal, N., Chaudhary, V., Wenzek, G., Guzmán, F., Grave, E., Ott, M., Zettlemoyer, L. and Stoyanov, V. (2020), “Unsupervised Cross-lingual Representation Learning at Scale”, in Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, Association for Computational Linguistics, Online, pp. 8440–8451.CrossRefGoogle Scholar
Conneau, A., Rinott, R., Lample, G., Williams, A., Bowman, S.R., Schwenk, H. and Stoyanov, V. (2018), “XNLI: Evaluating Cross-lingual Sentence Representations”, in Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, Association for Computational Linguistics.Google Scholar
Devlin, J., Chang, M.-W., Lee, K. and Toutanova, K. (2019), “BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding”, in Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), Association for Computational Linguistics, Minneapolis, Minnesota, pp. 41714186.Google Scholar
Hinton, Geoffrey, Vinyals, Oriol and Dean, Jeff (2015), Distilling the Knowledge in a Neural Network, available at: https://arxiv.org/pdf/1503.02531.pdf.Google Scholar
Goldberg, Y. (2017), Neural Network Methods in Natural Language Processing, Morgan and Claypool Publishers.Google Scholar
Grave, E., Bojanowski, P., Gupta, P., Joulin, A. and Mikolov, T. (2018), “Learning Word Vectors for 157 Languages”, in Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018), European Language Resources Association (ELRA), Miyazaki, Japan.Google Scholar
Kusner, M.J., Sun, Y., Kolkin, N.I. and Weinberger, K.Q. (2015), “From Word Embeddings to Document Distances”, in Proceedings of the 32nd International Conference on International Conference on Machine Learning - Volume 37, JMLR.org, pp. 957966.Google Scholar
Mikolov, T., Chen, K., Corrado, G. and Dean, J. (2013), Efficient Estimation of Word Representations in Vector Space, available at: http://arxiv.org/pdf/1301.3781v3.Google Scholar
Reimers, N. and Gurevych, I. (2019), “Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks”, in Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), Association for Computational Linguistics, Hong Kong, China, pp. 39823992.Google Scholar
Reimers, N. and Gurevych, I. (2020), “Making Monolingual Sentence Embeddings Multilingual using Knowledge Distillation”, in Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), Association for Computational Linguistics, Online, pp. 4512–4525.Google Scholar
Roy, U., Constant, N., Al-Rfou, R., Barua, A., Phillips, A., & Yang, Y. (2020). LAReQA: Language-Agnostic Answer Retrieval from a Multilingual Pool. Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), 5919–5930. 10.18653/v1/2020.emnlp-main.477.CrossRefGoogle Scholar
Dutta, Sourav (2021), “Alignment is All You Need”: Analyzing Cross-Lingual Text Similarity for Domain-Specific Applications”.Google Scholar
Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, L. and Polosukhin, I. (2017), “Attention is All You Need”, in Proceedings of the 31st International Conference on Neural Information Processing Systems, Curran Associates Inc, Red Hook, NY, USA, pp. 60006010.Google Scholar
Liu, Yinhan, Ott, Myle, Goyal, Naman, Du, Jingfei, Joshi, Mandar, Chen, Danqi, Levy, Omer, Lewis, Mike, Zettlemoyer, Luke and Stoyanov, Veselin (2019), RoBERTa: A Robustly Optimized BERT Pretraining Approach, available at: https://arxiv.org/pdf/1907.11692v1.pdf.Google Scholar