Parameter-efficient feature-based transfer for paraphrase identification

Xiaodong Liu; Rafal Rzepka; Kenji Araki

doi:10.1017/S135132492200050X

Parameter-efficient feature-based transfer for paraphrase identification

Published online by Cambridge University Press: 19 December 2022

Xiaodong Liu ,

Rafal Rzepka and

Kenji Araki

Show author details

Xiaodong Liu*: Affiliation:
Graduate School of Information Science and Technology, Hokkaido University, Sapporo, Hokkaido, Japan
Rafal Rzepka: Affiliation:
Faculty of Information Science and Technology, Hokkaido University, Sapporo, Hokkaido, Japan
Kenji Araki: Affiliation:
Faculty of Information Science and Technology, Hokkaido University, Sapporo, Hokkaido, Japan
*: *Corresponding author. E-mail: [email protected]

Article contents

Abstract
References

Get access

Rights & Permissions

Abstract

There are many types of approaches for Paraphrase Identification (PI), an NLP task of determining whether a sentence pair has equivalent semantics. Traditional approaches mainly consist of unsupervised learning and feature engineering, which are computationally inexpensive. However, their task performance is moderate nowadays. To seek a method that can preserve the low computational costs of traditional approaches but yield better task performance, we take an investigation into neural network-based transfer learning approaches. We discover that by improving the usage of parameters efficiently for feature-based transfer, our research goal can be accomplished. Regarding the improvement, we propose a pre-trained task-specific architecture. The fixed parameters of the pre-trained architecture can be shared by multiple classifiers with small additional parameters. As a result, the computational cost left involving parameter update is only generated from classifier-tuning: the features output from the architecture combined with lexical overlap features are fed into a single classifier for tuning. Furthermore, the pre-trained task-specific architecture can be applied to natural language inference and semantic textual similarity tasks as well. Such technical novelty leads to slight consumption of computational and memory resources for each task and is also conducive to power-efficient continual learning. The experimental results show that our proposed method is competitive with adapter-BERT (a parameter-efficient fine-tuning approach) over some tasks while consuming only 16% trainable parameters and saving 69-96% time for parameter update.

Keywords

Parameter-efficient feature-based transfer Paraphrase identification Natural language inference Semantic textual similarity Continual learning

Type: Article
Information: Natural Language Engineering , Volume 29 , Issue 4 , July 2023 , pp. 1066 - 1096

DOI: https://doi.org/10.1017/S135132492200050X [Opens in a new window]
Copyright: © The Author(s), 2022. Published by Cambridge University Press

Access options

Get access to the full version of this content by using one of the access options below. (Log in options will check for institutional or personal access. Content may require purchase if you do not have access.)

Article purchase

Temporarily unavailable

References

Beltagy, I., Lo, K. and Cohan, A. (2019). SciBERT: a pretrained language model for scientific text. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), Hong Kong, China. Association for Computational Linguistics, pp. 3615–3620.CrossRef Google Scholar

Bengio, Y., Louradour, J., Collobert, R. and Weston, J. (2009). Curriculum learning. In Proceedings of the 26th Annual International Conference on Machine Learning. ICML’09. New York, NY, USA: Association for Computing Machinery, pp. 41–48.CrossRef Google Scholar

Bromley, J., Bentz, J.W., Bottou, L., Guyon, I., LeCun, Y., Moore, C., Säckinger, E. and Shah, R. (1993). Signature verification using a “siamese”’ time delay neural network. International Journal of Pattern Recognition and Artificial Intelligence 7(04), 669–688.CrossRef Google Scholar

Caruana, R. (1998). Multitask Learning. Boston, MA: Springer US, pp. 95–133.Google Scholar

Cer, D., Diab, M., Agirre, E., Lopez-Gazpio, I. and Specia, L. (2017). SemEval-2017 task 1: semantic textual similarity multilingual and crosslingual focused evaluation. In Proceedings of the 11th International Workshop on Semantic Evaluation (SemEval-2017), Vancouver, Canada. Association for Computational Linguistics, pp. 1–14.CrossRef Google Scholar

Chandrasekaran, D. and Mago, V. (2020). Evolution of semantic similarity - A survey. CoRR, abs/2004.13820.Google Scholar

Chen, T., Goodfellow, I. and Shlens, J. (2016). Net2net: accelerating learning via knowledge transfer.Google Scholar

Conneau, A. and Kiela, D. (2018). SentEval: an evaluation toolkit for universal sentence representations. In Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018), Miyazaki, Japan. European Language Resources Association (ELRA).Google Scholar

Corbeil, J.-P. and Abdi Ghavidel, H. (2021). Assessing the eligibility of backtranslated samples based on semantic similarity for the paraphrase identification task. In Proceedings of the International Conference on Recent Advances in Natural Language Processing (RANLP 2021), pp. 301–308, Held Online. INCOMA Ltd.CrossRef Google Scholar

Das, D. and Smith, N.A. (2009). Paraphrase identification as probabilistic quasi-synchronous recognition. In Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP, Suntec, Singapore. Association for Computational Linguistics, pp. 468–476.CrossRef Google Scholar

de Masson d’Autume, C., Ruder, S., Kong, L. and Yogatama, D. (2019). Episodic memory in lifelong language learning. In Wallach, H., Larochelle, H., Beygelzimer, A., d’Alché-Buc, F., Fox, E. and Garnett, R. (eds), Advances in Neural Information Processing Systems, vol. 32, Curran Associates, Inc.Google Scholar

de Wynter, A. and Perry, D.J. (2020). Optimal subarchitecture extraction for BERT. CoRR. https://arxiv.org/abs/2010.10499 Google Scholar

Deerwester, S., Dumais, S.T., Furnas, G.W., Landauer, T.K. and Harshman, R. (1990). Indexing by latent semantic analysis. Journal of the American Society for Information Science 41(6), 391–407.3.0.CO;2-9>CrossRef Google Scholar

Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K. and Fei-Fei, L. (2009). Imagenet: a large-scale hierarchical image database. In 2009 IEEE Conference on Computer Vision and Pattern Recognition. IEEE, pp. 248–255.CrossRef Google Scholar

Devlin, J., Chang, M.-W., Lee, K. and Toutanova, K. (2019). BERT: pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), Minneapolis, Minnesota. Association for Computational Linguistics, pp. 4171–4186.Google Scholar

Dolan, B., Quirk, C. and Brockett, C. (2004). Unsupervised construction of large paraphrase corpora: exploiting massively parallel news sources. In COLING 2004: Proceedings of the 20th International Conference on Computational Linguistics, Geneva, Switzerland. COLING, pp. 350–356.CrossRef Google Scholar

Dong, D., Wu, H., He, W., Yu, D. and Wang, H. (2015). Multi-task learning for multiple language translation. In Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), Beijing, China. Association for Computational Linguistics, pp. 1723–1732.CrossRef Google Scholar

Faruqui, M., Dodge, J., Jauhar, S.K., Dyer, C., Hovy, E. and Smith, N.A. (2015). Retrofitting word vectors to semantic lexicons. In Proceedings of the 2015 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Denver, Colorado. Association for Computational Linguistics, pp. 1606–1615.CrossRef Google Scholar

Fellbaum, C. (ed.) (1998). WordNet: An Electronic Lexical Database, Language, Speech, and Communication. Cambridge, MA: MIT Press.CrossRef Google Scholar

Fernando, S. and Stevenson, M. (2008). A semantic similarity approach to paraphrase detection. In Proceedings of the 11th Annual Research Colloquium of the UK Special Interest Group for Computational Linguistics, pp. 45–52.Google Scholar

French, R.M. (1999). Catastrophic forgetting in connectionist networks. Trends in Cognitive Sciences 3(4), 128–135.CrossRef Google Scholar PubMed

Gammerman, A., Vovk, V. and Vapnik, V. (1998). Learning by transduction. In Proceedings of the Fourteenth Conference on Uncertainty in Artificial Intelligence. Morgan Kaufmann Publishers Inc, pp. 148–155.Google Scholar

Glavaš, G. and Vulić, I. (2018). Explicit retrofitting of distributional word vectors. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Melbourne, Australia. Association for Computational Linguistics, pp. 34–45.CrossRef Google Scholar

Guo, W. and Diab, M. (2012). Modeling sentences in the latent space. In Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics: Long Papers-volume 1. Association for Computational Linguistics, pp. 864–872.Google Scholar

He, H., Gimpel, K. and Lin, J. (2015). Multi-perspective sentence similarity modeling with convolutional neural networks. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, Lisbon, Portugal. Association for Computational Linguistics, pp. 1576–1586.CrossRef Google Scholar

He, Y., Wang, Z., Zhang, Y., Huang, R. and Caverlee, J. (2020). PARADE: a new dataset for paraphrase identification requiring computer science domain knowledge. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP). Online. Association for Computational Linguistics, pp. 7572–7582.CrossRef Google Scholar

Hedderich, M.A., Lange, L., Adel, H., Strötgen, J. and Klakow, D. (2021). A survey on recent approaches for natural language processing in low-resource scenarios. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. Online. Association for Computational Linguistics, pp. 2545–2568.CrossRef Google Scholar

Houlsby, N., Giurgiu, A., Jastrzebski, S., Morrone, B., De Laroussilhe, Q., Gesmundo, A., Attariyan, M. and Gelly, S. (2019a). Parameter-efficient transfer learning for NLP. In Chaudhuri, K. and Salakhutdinov, R. (eds), Proceedings of the 36th International Conference on Machine Learning, Proceedings of Machine Learning Research , vol. 97. PMLR, pp. 2790–2799.Google Scholar

Houlsby, N., Giurgiu, A., Jastrzebski, S., Morrone, B., de Laroussilhe, Q., Gesmundo, A., Attariyan, M. and Gelly, S. (2019b). Parameter-efficient transfer learning for nlp.Google Scholar

Howard, J. and Ruder, S. (2018). Universal language model fine-tuning for text classification. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Melbourne, Australia. Association for Computational Linguistics, pp. 328–339.CrossRef Google Scholar

Hung, C.-Y., Tu, C.-H., Wu, C.-E., Chen, C.-H., Chan, Y.-M. and Chen, C.-S. (2019). Compacting, picking and growing for unforgetting continual learning. In Wallach, H., Larochelle, H., Beygelzimer, A., d’Alché-Buc, F., Fox, E. and Garnett, R. (eds), Advances in Neural Information Processing Systems, vol. 32. Curran Associates, Inc.Google Scholar

Iyer, S., Dandekar, N. and Csernai, K. (2017). First Quora Dataset Release: Question Pairs. Available at: https://quoradata.quora.com/First-Quora-Dataset-Release-Question-Pairs.Google Scholar

Jaccard, P. (1912). The distribution of the flora in the alpine zone. 1. New Phytologist 11(2), 37–50.CrossRef Google Scholar

Ji, Y. and Eisenstein, J. (2013). Discriminative improvements to distributional sentence similarity. In Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing, pp. 891–896.Google Scholar

Kadotani, S., Kajiwara, T., Arase, Y. and Onizuka, M. (2021). Edit distance based curriculum learning for paraphrase generation. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing: Student Research Workshop. Online. Association for Computational Linguistics, pp. 229–234.CrossRef Google Scholar

Khot, T., Sabharwal, A. and Clark, P. (2018). SciTail: a textual entailment dataset from science question answering. In AAAI.CrossRef Google Scholar

Kirkpatrick, J., Pascanu, R., Rabinowitz, N., Veness, J., Desjardins, G., Rusu, A.A., Milan, K., Quan, J., Ramalho, T., Grabska-Barwinska, A., Hassabis, D., Clopath, C., Kumaran, D. and Hadsell, R. (2017). Overcoming catastrophic forgetting in neural networks. Proceedings of The National Academy of Sciences of The United States of America 114(13), 3521–3526.CrossRef Google Scholar PubMed

Lan, W., Qiu, S., He, H. and Xu, W. (2017). A continuously growing dataset of sentential paraphrases. In Proceedings of The 2017 Conference on Empirical Methods on Natural Language Processing (EMNLP). Association for Computational Linguistics, pp. 1235–1245.CrossRef Google Scholar

Lan, Z., Chen, M., Goodman, S., Gimpel, K., Sharma, P. and Soricut, R. (2019). ALBERT: a lite BERT for self-supervised learning of language representations. CoRR, abs/1909.11942.Google Scholar

Leong, C.W.B., Beigman Klebanov, B., Hamill, C., Stemle, E., Ubale, R. and Chen, X. (2020). A report on the 2020 VUA and TOEFL metaphor detection shared task. In Proceedings of the Second Workshop on Figurative Language Processing. Online. Association for Computational Linguistics, pp. 18–29.CrossRef Google Scholar

Leong, C.W.B., Beigman Klebanov, B. and Shutova, E. (2018). A report on the 2018 VUA metaphor detection shared task. In Proceedings of the Workshop on Figurative Language Processing, New Orleans, Louisiana. Association for Computational Linguistics, pp. 56–66.CrossRef Google Scholar

Liu, X., Zheng, Y., Du, Z., Ding, M., Qian, Y., Yang, Z. and Tang, J. (2021). Gpt understands, too.Google Scholar

Lopez-Paz, D. and Ranzato, M.A. (2017). Gradient episodic memory for continual learning. In Guyon, I., Luxburg, U.V., Bengio, S., Wallach, H., Fergus, R., Vishwanathan, S. and Garnett, R. (eds), Advances in Neural Information Processing Systems, vol. 30. Curran Associates, Inc.Google Scholar

Madnani, N., Tetreault, J. and Chodorow, M. (2012). Re-examining machine translation metrics for paraphrase identification. In Proceedings of the 2012 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Montréal, Canada. Association for Computational Linguistics, pp. 182–190.Google Scholar

Marelli, M., Bentivogli, L., Baroni, M., Bernardi, R., Menini, S. and Zamparelli, R. (2014). SemEval-2014 task 1: evaluation of compositional distributional semantic models on full sentences through semantic relatedness and textual entailment. In Proceedings of the 8th International Workshop on Semantic Evaluation (SemEval 2014), Dublin, Ireland. Association for Computational Linguistics, pp. 1–8.CrossRef Google Scholar

McCann, B., Keskar, N.S., Xiong, C. and Socher, R. (2018). The natural language decathlon: multitask learning as question answering. CoRR, abs/1806.08730.Google Scholar

McCloskey, M. and Cohen, N. J. (1989). Catastrophic interference in connectionist networks: the sequential learning problem. Psychology of Learning and Motivation 24, 109–165, Academic Press.CrossRef Google Scholar

Meng, Y., Ao, X., He, Q., Sun, X., Han, Q., Wu, F., Fan, C. and Li, J. (2021). ConRPG: paraphrase generation using contexts as regularizer. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, Online and Punta Cana, Dominican Republic. Association for Computational Linguistics, pp. 2551–2562.CrossRef Google Scholar

Mikolov, T., Grave, E., Bojanowski, P., Puhrsch, C. and Joulin, A. (2018). Advances in pre-training distributed word representations. In Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018), Miyazaki, Japan. European Language Resources Association (ELRA).Google Scholar

Mikolov, T., Sutskever, I., Chen, K., Corrado, G. S. and Dean, J. (2013). Distributed representations of words and phrases and their compositionality. In Advances in Neural Information Processing Systems, pp. 3111–3119.Google Scholar

Mohiuddin, T. and Joty, S. (2020). Unsupervised word translation with adversarial autoencoder. Computational Linguistics 46(2), 257–288.CrossRef Google Scholar

Nighojkar, A. and Licato, J. (2021). Improving paraphrase detection with the adversarial paraphrasing task. CoRR, abs/2106.07691.CrossRef Google Scholar

Pennington, J., Socher, R. and Manning, C. (2014). GloVe: global vectors for word representation. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), Doha, Qatar. Association for Computational Linguistics, pp. 1532–1543.CrossRef Google Scholar

Peters, M., Neumann, M., Iyyer, M., Gardner, M., Clark, C., Lee, K. and Zettlemoyer, L. (2018). Deep contextualized word representations. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), New Orleans, Louisiana. Association for Computational Linguistics, pp. 2227–2237.CrossRef Google Scholar

Popović, M. (2015). chrF: character n-gram f-score for automatic MT evaluation. In Proceedings of the Tenth Workshop on Statistical Machine Translation, Lisbon, Portugal. Association for Computational Linguistics, pp. 392–395.CrossRef Google Scholar

Radford, A., Wu, J., Child, R., Luan, D., Amodei, D. and Sutskever, I. (2019). Language models are unsupervised multitask learners.Google Scholar

Reimers, N. and Gurevych, I. (2019). Sentence-BERT: sentence embeddings using Siamese BERT-networks. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), Hong Kong, China. Association for Computational Linguistics, pp. 3982–3992.CrossRef Google Scholar

Rusu, A.A., Rabinowitz, N.C., Desjardins, G., Soyer, H., Kirkpatrick, J., Kavukcuoglu, K., Pascanu, R. and Hadsell, R. (2016). Progressive neural networks.Google Scholar

Sanh, V., Debut, L., Chaumond, J. and Wolf, T. (2019). Distilbert, a distilled version of BERT: smaller, faster, cheaper and lighter. CoRR, abs/1910.01108.Google Scholar

Sellam, T., Das, D. and Parikh, A.P. (2020, BLEURT: learning robust metrics for text generation, CoRR, abs/2004.04696.CrossRef Google Scholar

Shi, W., Chen, M., Zhou, P. and Chang, K.-W. (2019). Retrofitting contextualized word embeddings with paraphrases. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), Hong Kong, China. Association for Computational Linguistics, pp. 1198–1203.CrossRef Google Scholar

Socher, R., Huang, E.H., Pennin, J., Manning, C.D. and Ng, A.Y. (2011). Dynamic pooling and unfolding recursive autoencoders for paraphrase detection. In Advances in Neural Information Processing Systems, pp. 801–809.Google Scholar

Strubell, E., Ganesh, A. and McCallum, A. (2019). Energy and policy considerations for deep learning in NLP. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, Florence, Italy. Association for Computational Linguistics, pp. 3645–3650.CrossRef Google Scholar

Sun, F.-K., Ho, C.-H. and Lee, H.-Y. (2020). LAMOL: LAnguage MOdeling for lifelong language learning. In International Conference on Learning Representations.Google Scholar

Thrun, S. (1998). Lifelong Learning Algorithms. Boston, MA: Springer US, pp. 181–209.Google Scholar

van de Ven, G. M. and Tolias, A. S. (2019). Three scenarios for continual learning.Google Scholar

Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł. and Polosukhin, I. (2017). Attention is all you need. In Advances in Neural Information Processing Systems, pp. 5998–6008.Google Scholar

Wan, S., Dras, M., Dale, R. and Paris, C. (2006). Using dependency-based features to take the’para-farce’out of paraphrase. In Proceedings of the Australasian Language Technology Workshop 2006, pp. 131–138.Google Scholar

Wang, A., Singh, A., Michael, J., Hill, F., Levy, O. and Bowman, S. (2018). GLUE: a multi-task benchmark and analysis platform for natural language understanding. In Proceedings of the 2018 EMNLP Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP, Brussels, Belgium. Association for Computational Linguistics, pp. 353–355.CrossRef Google Scholar

Weston, J., Bordes, A., Chopra, S., Rush, A. M., van Merriënboer, B., Joulin, A. and Mikolov, T. (2015). Towards ai-complete question answering: a set of prerequisite toy tasks.Google Scholar

Yang, Z., Dai, Z., Yang, Y., Carbonell, J., Salakhutdinov, R.R. and Le, Q.V. (2019). Xlnet: generalized autoregressive pretraining for language understanding. In Wallach, H., Larochelle, H., Beygelzimer, A., d’Alché-Buc, F., Fox, E. and Garnett, R. (eds), Advances in Neural Information Processing Systems 32. Curran Associates, Inc, pp. 5753–5763.Google Scholar

Yin, W. and Schütze, H. (2015a). Convolutional neural network for paraphrase identification. In Proceedings of the 2015 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Denver, Colorado. Association for Computational Linguistics, pp. 901–911.CrossRef Google Scholar

Yin, W. and Schütze, H. (2015b). Discriminative phrase embedding for paraphrase identification. In Proceedings of the 2015 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Denver, Colorado. Association for Computational Linguistics, pp. 1368–1373.CrossRef Google Scholar

Yin, W., Schütze, H., Xiang, B. and Zhou, B. (2016). ABCNN: attention-based convolutional neural network for modeling sentence pairs. Transactions of the Association for Computational Linguistics 4, 259–272.CrossRef Google Scholar

Yosinski, J., Clune, J., Bengio, Y. and Lipson, H. (2014). How transferable are features in deep neural networks?. In Ghahramani, Z., Welling, M., Cortes, C., Lawrence, N. and Weinberger, K. Q. (eds), Advances in Neural Information Processing Systems, vol. 27, Curran Associates, Inc.Google Scholar

Yu, Z., Cohen, T., Wallace, B., Bernstam, E. and Johnson, T. (2016). Retrofitting word vectors of MeSH terms to improve semantic similarity measures. In Proceedings of the Seventh International Workshop on Health Text Mining and Information Analysis, Auxtin, TX. Association for Computational Linguistics, pp. 43–51.CrossRef Google Scholar

Zamir, A.R., Sax, A., Shen, W., Guibas, L.J., Malik, J. and Savarese, S. (2018). Taskonomy: disentangling task transfer learning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR).CrossRef Google Scholar

Zhang, Y., Baldridge, J. and He, L. (2019). PAWS: paraphrase adversaries from word scrambling. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), Minneapolis, Minnesota. Association for Computational Linguistics, pp. 1298–1308.Google Scholar

Article contents

Parameter-efficient feature-based transfer for paraphrase identification

Abstract

Keywords

Access options

Article purchase

Temporarily unavailable

References

Save article to Kindle

Save article to Dropbox

Save article to Google Drive

Reply to: Submit a response

Your details

You have entered the maximum number of contributors

Conflicting interests