Hostname: page-component-78c5997874-4rdpn Total loading time: 0 Render date: 2024-11-18T15:11:01.191Z Has data issue: false hasContentIssue false

Jointly learning sentence embeddings and syntax with unsupervised Tree-LSTMs

Published online by Cambridge University Press:  31 July 2019

Jean Maillard*
Affiliation:
Department of Computer Science and Technology, University of Cambridge, Cambridge, United Kingdom
Stephen Clark
Affiliation:
DeepMind, London, UK
Dani Yogatama
Affiliation:
DeepMind, London, UK
*
*Corresponding author. Email: [email protected]

Abstract

We present two studies on neural network architectures that learn to represent sentences by composing their words according to automatically induced binary trees, without ever being shown a correct parse tree. We use Tree-Long Short-Term Memories (LSTMs) as our composition function, applied along a tree structure found by a differentiable natural language chart parser. The models simultaneously optimise both the composition function and the parser, thus eliminating the need for externally provided parse trees, which are normally required for Tree-LSTMs. They can therefore be seen as tree-based recurrent neural networks that are unsupervised with respect to the parse trees. Due to being fully differentiable, the models are easily trained with an off-the-shelf gradient descent method and backpropagation.

In the first part of this paper, we introduce a model based on the CKY chart parser, and evaluate its downstream performance on a natural language inference task and a reverse dictionary task. Further, we show how its performance can be improved with an attention mechanism which fully exploits the parse chart, by attending over all possible subspans of the sentence. We find that our approach is competitive against similar models of comparable size and outperforms Tree-LSTMs that use trees produced by a parser.

Finally, we present an alternative architecture based on a shift-reduce parser. We perform an analysis of the trees induced by both our models, to investigate whether they are consistent with each other and across re-runs, and whether they resemble the trees produced by a standard parser.

Type
Article
Copyright
© Cambridge University Press 2019 

Access options

Get access to the full version of this content by using one of the access options below. (Log in options will check for institutional or personal access. Content may require purchase if you do not have access.)

References

Bahdanau, D., Cho, K. and Bengio, Y. (2015). Neural machine translation by jointly learning to align and translate. In ICLR, pp. 115.Google Scholar
Bengio, Y., Léonard, N. and Courville, A.C. (2013). Estimating or propagating gradients through stochastic neurons for conditional computation. arXiv:1305.2982.Google Scholar
Bowman, S.R., Angeli, G., Potts, C. and Manning, C.D. (2015). A large annotated corpus for learning natural language inference. In EMNLP. ACL. pp. 632642.Google Scholar
Bowman, S.R., Gauthier, J., Rastogi, A., Gupta, R., Manning, C.D. and Potts, C. (2016). A fast unified model for parsing and sentence understanding. In ACL, pp. 14661477.CrossRefGoogle Scholar
Choi, J., Yoo, K.M. and Lee, S. (2018). Learning to compose task-specific tree structures. In AAAI, pp. 50945101.Google Scholar
Chomsky, N. (1957). Syntactic Structures. The Hague, Netherlands: Mouton and Co.Google Scholar
Cocke, J. (1969). Programming Languages and Their Compilers: Preliminary Notes. New York, NY: Courant Institute of Mathematical Sciences, New York University.Google Scholar
Coecke, B., Sadrzadeh, M. and Clark, S. (2011). Mathematical foundations for a compositional distributed model of meaning. Linguistic Analysis 36(1–4), 345384.Google Scholar
Devlin, J., Chang, M.-W., Lee, K. and Toutanova, K. (2018). Bert: pre-training of deep bidirectional transformers for language understanding. arXiv:1810.04805Google Scholar
Dyer, C., Kuncoro, A., Ballesteros, M and Smith, N.A. (2016). Recurrent neural network grammars. In NAACL-HLT. ACL.CrossRefGoogle Scholar
Graves, A. and Schmidhuber, J. (2005). Framewise phoneme classification with bidirectional LSTM and other neural network architectures. Neural Networks 18(5–6), 602610.CrossRefGoogle ScholarPubMed
Gururangan, S., Swayamdipta, S., Levy, O., Schwartz, R., Bowman, S. and Smith, N.A. (2018). Annotation artifacts in natural language inference data. In NAACL-HLT (Short Papers). ACL.Google Scholar
Hill, F., Cho, K., Korhonen, A. and Bengio, Y. (2016). Learning to understand phrases by embedding the dictionary. Transactions of the Association for Computational Linguistics 4, pp. 17–3.CrossRefGoogle Scholar
Hochreiter, S. and Schmidhuber, J. (1997). Long short-term memory. Neural Comput. 9(8), 17351780.CrossRefGoogle ScholarPubMed
Htut, P.M., Cho, K. and Bowman, S. (2018). Grammar induction with neural language models: An unusual replication. In Proceedings of the 2018 EMNLP Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP. ACL.Google Scholar
Jang, E., Gu, S. and Poole, B. (2017). Categorical reparameterization with gumbel-softmax.Google Scholar
Józefowicz, R., Vinyals, O., Schuster, M., Shazeer, N. and Wu, Y. (2016). Exploring the limits of language modeling. arXiv:1602.02410.Google Scholar
Jozefowicz, R., Zaremba, W. and Sutskever, I. (2015). An empirical exploration of recurrent network architectures. J Machine Learning Research.Google Scholar
Kalchbrenner, N., Grefenstette, E. and Blunsom, P. (2014). A convolutional neural network for modelling sentences. ACL. pp. 655665CrossRefGoogle Scholar
Kasami, T. (1965). An Efficient Recognition and Syntax Analysis Algorithm for Context-Free Languages. Technical Report AFCRL-65-758, Air Force Cambridge Research Laboratory.Google Scholar
Kim, Y., Denton, C., Hoang, L. and Rush, A.M. (2017). Structured attention networks. In ICLR.Google Scholar
Kingma, D.P. and Ba, J. (2015). Adam: A method for stochastic optimization. In ICLR.Google Scholar
Kiperwasser, E. and Goldberg, Y. (2016). Easy-first dependency parsing with hierarchical tree lstms. TACL 4, pp. 445461.Google Scholar
Kirkpatrick, J., Pascanu, R., Rabinowitz, N., Veness, J., Desjardins, G., Rusu, A.A., Milan, K., Quan, J., Ramalho, T., Grabska-Barwinska, A., Hassabis, D., Clopath, C., Kumaran, D. and Hadsell, R. (2017). Overcoming catastrophic forgetting in neural networks. Proceedings of the National Academy of Sciences 114(13), 35213526.CrossRefGoogle ScholarPubMed
Le, P. and Zuidema, W. (2015). The forest convolutional network: Compositional distributional semantics with a neural chart and without binarization. In EMNLP. ACL.Google Scholar
Ma, M., Huang, L., Xiang, B. and Zhou, B. (2015). Dependency-based convolutional neural networks for sentence embedding. In ACL-IJCNLP. ACL.Google Scholar
Maillard, J. and Clark, S. (2018). Latent tree learning with differentiable parsers: Shift-reduce parsing and chart parsing. In Workshop on the Relevance of Linguistic Structure in Neural Architectures for NLP. ACL.Google Scholar
Manning, C.D., Surdeanu, M., Bauer, J., Finkel, J., Bethard, S.J. and McClosky, D. (2014). The Stanford CoreNLP natural language processing toolkit. In ACL System Demonstrations.CrossRefGoogle Scholar
Mikolov, T., Yih, W.-T. and Zweig, G. (2013). Linguistic regularities in continuous space word representations. In NAACL-HLT. ACL.Google Scholar
Neubig, G., Dyer, C., Goldberg, Y., Matthews, A., Ammar, W., Anastasopoulos, A., Ballesteros, M., Chiang, D., Clothiaux, D., Cohn, T., Duh, K., Faruqui, M., Gan, C., Garrette, D., Ji, Y., Kong, L., Kuncoro, A., Kumar, G., Malaviya, C., Michel, P., Oda, Y., Richardson, M., Saphra, N., Swayamdipta, S. and Yin, P. (2017). Dynet: The dynamic neural network toolkit. arXiv:1701.03980.Google Scholar
Paperno, D., Pham, N.T. and Baroni, M. (2014). A practical and linguistically-motivated approach to compositional distributional semantics. In ACL. ACL.Google Scholar
Pennington, J., Socher, R. and Manning, C.D. (2014). GloVe: Global vectors for word representation. In EMNLP. ACL.Google Scholar
Rush, A.M., Chopra, S. and Weston, J. (2015). A neural attention model for abstractive sentence summarization. In EMNLP. ACL.Google Scholar
Sha, L., Chang, B., Sui, Z. and Li, S. (2016). Reading and thinking: Re-read LSTM unit for textual entailment recognition. In COLING.Google Scholar
Shen, T., Zhou, T., Long, G., Jiang, J., Pan, S. and Zhang, C. (2018). DiSAN: Directional self-attention network for RNN/CNN-free language understanding. In AAAI.Google Scholar
Socher, R., Bauer, J., Manning, C.D. and Ng, A.Y. (2013). Parsing with compositional vector grammars. In ACL.Google Scholar
Socher, R., Huval, B., Manning, C.D. and Ng, A.Y. (2012). Semantic compositionality through recursive matrix-vector spaces. In EMNLP-CoNLL. ACL.Google Scholar
Steedman, M. (2000). The Syntactic Process. Cambridge, MA: MIT Press.Google Scholar
Subramanian, S., Trischler, A., Bengio, Y. and Pal, C.J. (2018). Learning general purpose distributed sentence representations via large scale multi-task learning. In ICLR.Google Scholar
Sundermeyer, M., Schlüter, R. and Ney, H. (2012). LSTM neural networks for language modeling. In INTERSPEECH.Google Scholar
Sutskever, I., Vinyals, O. and Le, Q.V. (2014). Sequence to sequence learning with neural networks. In NIPS. MIT Press.Google Scholar
Tai, K.S., Socher, R. and Manning, C.D. (2015). Improved semantic representations from tree-structured long short-term memory networks. In ACL-IJCNLP. ACL.Google Scholar
Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, L. and Polosukhin, I. (2017). Attention is all you need. In NIPS.Google Scholar
Williams, A., Drozdov, A. and Bowman, S.R. (2018). Do latent tree learning models identify meaningful structure in sentences? Transactions of the Association for Computational Linguistics 6, 253267.CrossRefGoogle Scholar
Williams, A., Nangia, N. and Bowman, S. (2018). A broad-coverage challenge corpus for sentence understanding through inference. In NAACL-HLT (Long Papers). ACL.Google Scholar
Yogatama, D., Blunsom, P., Dyer, C., Grefenstette, E. and Ling, W. (2016). Learning to compose words into sentences with reinforcement learning.Google Scholar
Younger, D.H. (1967). Recognition and parsing of context-free languages in time n 3. Information and Control 10, 189208.CrossRefGoogle Scholar
Zhu, X., Sobhani, P. and Guo, H. (2015). Long short-term memory over recursive structures. In ICML.Google Scholar