Incorporating word embeddings in unsupervised morphological segmentation

Ahmet Üstün; Burcu Can

doi:10.1017/S1351324920000406

Incorporating word embeddings in unsupervised morphological segmentation

Published online by Cambridge University Press: 10 July 2020

Ahmet Üstün and

Burcu Can

Show author details

Ahmet Üstün: Affiliation:
The University of Groningen, Groningen, The Netherlands
Burcu Can*: Affiliation:
Department of Computer Engineering, Hacettepe University, Ankara, Turkey
*: *Corresponding author. E-mail: [email protected]

Article contents

Abstract
References

Get access

Rights & Permissions

Abstract

We investigate the usage of semantic information for morphological segmentation since words that are derived from each other will remain semantically related. We use mathematical models such as maximum likelihood estimate (MLE) and maximum a posteriori estimate (MAP) by incorporating semantic information obtained from dense word vector representations. Our approach does not require any annotated data which make it fully unsupervised and require only a small amount of raw data together with pretrained word embeddings for training purposes. The results show that using dense vector representations helps in morphological segmentation especially for low-resource languages. We present results for Turkish, English, and German. Our semantic MLE model outperforms other unsupervised models for Turkish language. Our proposed models could be also used for any other low-resource language with concatenative morphology.

Keywords

Morphological segmentation Unsupervised learning Bayesian learning Low-resource language

Type: Article
Information: Natural Language Engineering , Volume 27 , Issue 5 , September 2021 , pp. 609 - 629

DOI: https://doi.org/10.1017/S1351324920000406 [Opens in a new window]
Copyright: © The Author(s), 2020. Published by Cambridge University Press

Access options

Get access to the full version of this content by using one of the access options below. (Log in options will check for institutional or personal access. Content may require purchase if you do not have access.)

References

Bojanowski, P., Grave, E., Joulin, A. and Mikolov, T. (2017). Enriching word vectors with subword information. In Transactions of the Association of Computational Linguistics, TACL, pp. 135–146.CrossRef Google Scholar

Can, B. and Manandhar, S. (2010). Clustering morphological paradigms using syntactic categories. In Proceedings of the Multilingual Information Access Evaluation I. Text Retrieval Experiments: 10th Workshop of the Cross-Language Evaluation Forum, Revised Selected Papers. Berlin, Heidelberg: Springer, pp. 641–648.CrossRef Google Scholar

Can, B. and Manandhar, S. (2018). Tree structured dirichlet processes for hierarchical morphological segmentation. Computational Linguistics 44(2), 349–374.CrossRef Google Scholar

Cao, K. and Rei, M. (2016). A joint model for word embedding and word morphology. In Proceedings of the 1st Workshop on Representation Learning for NLP, pp. 18–26.CrossRef Google Scholar

Clark, A. (2000). Inducing syntactic categories by context distribution clustering. In Proceedings of the 2nd Workshop on Learning Language in Logic and the 4th Conference on Computational Natural Language Learning - Volume 7, ConLL’00. Association for Computational Linguistics, pp. 91–94.CrossRef Google Scholar

Cotterell, R. and Schütze, H. (2015). Morphological word-embeddings. In Proceedings of the Human Language Technologies: The 2015 Annual Conference of the North American Chapter of the ACL, Denver, Colorado. Association for Computational Linguistics, pp. 1287–1292.CrossRef Google Scholar

Creutz, M. and Lagus, K. (2002). Unsupervised discovery of morphemes. In Proceedings of the ACL-02 Workshop on Morphological and Phonological Learning - Volume 6, MPL’02. Association for Computational Linguistics, pp. 21–30.CrossRef Google Scholar

Creutz, M. and Lagus, K. (2005a). Inducing the morphological lexicon of a natural language from unannotated text. In Proceedings of the International and Interdisciplinary Conference on Adaptive Knowledge Representation and Reasoning (AKRR 2005), pp. 106–113.Google Scholar

Creutz, M. and Lagus, K. (2005b). Unsupervised Morpheme Segmentation and Morphology Induction from Text Corpora Using Morfessor 1.0. Technical Report A81.Google Scholar

Creutz, M. and Lagus, K. (2007). Unsupervised models for morpheme segmentation and morphology learning. ACM Transactions Speech Language Processing 4, 3:1–3:34.CrossRef Google Scholar

de Marcken, C. (1996). Linguistic structure as composition and perturbation. In Proceedings of the 34th Annual Meeting of the Association for Computational Linguistics, Santa Cruz, California, USA. Association for Computational Linguistics, pp. 335–341.CrossRef Google Scholar

Geman, S. and Geman, D. (1984). Stochastic relaxation, Gibbs distributions, and the Bayesian restoration of images. IEEE Transactions on Pattern Analysis and Machine Intelligence 6(6), 721–741.CrossRef Google Scholar PubMed

Goldsmith, J. (2001). Unsupervised learning of the morphology of a natural language. Computational Linguistics 27(2), 153–198.CrossRef Google Scholar

Goldwater, S., Johnson, M. and Griffiths, T.L. (2006). Interpolating between types and tokens by estimating power-law generators. In Proceedings of the Advances in Neural Information Processing Systems 18. MIT Press, pp. 459–466.Google Scholar

Hankamer, J. (1986). Finite state morphology and left to right phonology. In Proceedings of the West Coast Conference on Formal Linguistics (WCCFL-5).Google Scholar

Harris, Z.S. (1955). From phoneme to morpheme. Language 31(2), 190–222.CrossRef Google Scholar

Harris, Z.S. (1970). Morpheme boundaries within words: report on a computer test. Papers in Structural and Transformational Linguistics, pp. 68–77.CrossRef Google Scholar

Kurimo, M., Lagus, K., Virpioja, S. and Turunen, V.T. (2011). Morpho Challenge 2010. http://research.ics.tkk.fi/events/morphochallenge2010/ (accessed 10 February 2017).Google Scholar

Lazaridou, A., Marelli, M., Zamparelli, R. and Baroni, M. (2013). Compositional-ly derived representations of morphologically complex words in distributional semantics. In Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Sofia, Bulgaria. Association for Computational Linguistics, pp. 1517–1526.Google Scholar

Lee, Y.K., Haghighi, A. and Barzilay, R. (2011). Modeling syntactic context improves morphological segmentation. In Proceedings of the Fifteenth Conference on Computational Natural Language Learning, CoNLL’11, Stroudsburg, PA, USA. Association for Computational Linguistics, pp. 1–9.Google Scholar

Mikolov, T., Chen, K., Corrado, G. and Dean, J. (2013). Efficient estimation of word representations in vector space. CoRR, abs/1301.3781.Google Scholar

Narasimhan, K., Barzilay, R. and Jaakkola, T.S. (2015). An unsupervised method for uncovering morphological chains. Transactions of the Association for Computational Linguistics (TACL) 3, 157–167.CrossRef Google Scholar

Schone, P. and Jurafsky, D. (2001). Knowledge-free induction of inflectional morphologies. In Proceedings of the Second Meeting of the North American Chapter of the Association for Computational Linguistics on Language Technologies, NAACL’01. Association for Computational Linguistics, pp. 1–9.CrossRef Google Scholar

Soricut, R. and Och, F. (2015). Unsupervised morphology induction using word embeddings. In Human Language Technologies: The 2015 Annual Conference of the North American Chapter of the ACL, pp. 1627–1637.CrossRef Google Scholar

Team, D.D. (2016). Deeplearning4j: Open-source distributed deep learning for the JVM, Apache Software Foundation License 2.0. http://deeplearning4j.org/ (accessed 10 February 2017).Google Scholar

Üstün, A. and Can, B. (2016). Unsupervised morphological segmentation using neural word embeddings. In Proceedings of the Statistical Language and Speech Processing: 4th International Conference, SLSP 2016, Pilsen, Czech Republic, October 11–12, 2016. Springer International Publishing, pp. 43–53.CrossRef Google Scholar

Üstün, A., Kurfal, M. and Can, B. (2018). Characters or morphemes: How to represent words? In Proceedings of The Third Workshop on Representation Learning for NLP, Melbourne, Australia. Association for Computational Linguistics, pp. 144–153.Google Scholar

Article contents

Incorporating word embeddings in unsupervised morphological segmentation

Abstract

Keywords

Access options

References

Save article to Kindle

Save article to Dropbox

Save article to Google Drive

Reply to: Submit a response

Your details

You have entered the maximum number of contributors

Conflicting interests