Hostname: page-component-586b7cd67f-r5fsc Total loading time: 0 Render date: 2024-11-22T18:00:50.008Z Has data issue: false hasContentIssue false

On morphological relatedness

Published online by Cambridge University Press:  10 February 2012

AHMED KHORSI*
Affiliation:
College of Computer and Information Science, Al-Imam Mohammad Ibn Saud Islamic University, Riyadh, Kingdom of Saudi Arabia email: [email protected], [email protected]

Abstract

In this paper, we discuss the results of a new unsupervised and computationally lightweight scoring of how two words are morphologically related to each other. This measure is meant to be an alternative to stemming, radicals (root) extraction, and morphological analysis in a wide range of applications; especially information extraction related ones. Compared to light stemming, which seems to be the most convenient approach for systems with efficiency concerns, our measure does not neglect unconditionally a prefix or a suffix as the light stemming does. Instead, our measure takes into account all letters of the word but with different weights. This prevents the missing of a significant letter. Compared to heavy stemming, morphological analysis, or radicals extraction, which rely on dictionaries and compatibility databases, our measure does not rely on any language-specific morphology knowledge. This makes our approach unsupervised and theoretically language independent and computationally much lighter. Our tests targeted Arabic: a Semitic language recognized to have a complex morphology due to its highly inflectional lexicon.

Type
Articles
Copyright
Copyright © Cambridge University Press 2012 

Access options

Get access to the full version of this content by using one of the access options below. (Log in options will check for institutional or personal access. Content may require purchase if you do not have access.)

References

Aslam, J. A., and Yilmaz, E. 2005. A geometric interpretation and analysis of R-precision. In Herzog, O., Schek, H.-J., Fuhr, N., Chowdhury, A., and Teiken, W. (eds.), Proceedings of the 2005 ACM CIKM International Conference on Information and Knowledge Management, pp. 664–71, Bremen, Germany.Google Scholar
Baroni, M., Matiasek, J., and Trost, H. 2002. Unsupervised discovery of morphologically related words based on orthographic and semantic similarity. In Proceedings of the ACL-02 Workshop on Morphological and Phonological Learning MPL '02, pp. 4857, Stroudsburg, PA.CrossRefGoogle Scholar
Boudlal, A., Belahbib, R., Lakhouaja, A., Mazroui, A., Meziane, A., and Bebah, M. 2011. A Markovian approach for Arabic root extraction. International Arab Journal of Information Technology 8 (1): 91–8.Google Scholar
Buckley, C., and Voorhees, E. M. 2000. Evaluating evaluation measure stability. In Proceedings of the 23rd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 3340, Athens, Greece.CrossRefGoogle Scholar
Buckwalter, T. 2004. Issues in Arabic orthography and morphology analysis. In Proceedings of the Workshop on Computational Approaches to Arabic Script-Based Languages, pp. 3134, Stroudsburg, PA.Google Scholar
Chen, A., and Gey, F. 2002. Building an Arabic stemmer for information retrieval. In Proceedings of TREC 2002, pp. 631–39, Gaithersburg, MD.Google Scholar
Creutz, M., & Lagus, K. 2002. Unsupervised discovery of morphemes. In Proceedings of the ACL-02 Workshop on Morphological and Phonological Learning - Volume 6, MPL '02, pp. 2130, Stroudsburg, PA.CrossRefGoogle Scholar
Crochemore, M., Hancart, C., & Lecroq, T. 2007. Algorithms on Strings. 1st ed. Cambridge University Press.CrossRefGoogle Scholar
Daya, E., Roth, D., and Wintner, S. 2008. Identifying Semitic roots: machine learning with linguistic constraints. Computational Linguistics 34 (3): 429–48.CrossRefGoogle Scholar
de Roeck, A. N., and Al-Fares, W. 2000. A morphologically sensitive clustering algorithm for identifying Arabic roots. Proceedings of the 38th Annual Meeting on Association for Computational Linguistics, pp. 199206, Hong Kong.Google Scholar
Goldsmith, J. 2001. Unsupervised learning of the morphology of a natural language. Computational Linguistics 27 (2): 153–98.CrossRefGoogle Scholar
Grnwald, P. D. 2007. The minimum description length principle. Cambridge, MA: MIT Press.CrossRefGoogle Scholar
Hafer, M. A., & Weiss, S. F. 1974. Word Segmentation by Letter Successor Varieties. Amsterdam: Elsevier.CrossRefGoogle Scholar
Harris, Z. S. 1955. From phoneme to morpheme. Language 31 (2): 190222.CrossRefGoogle Scholar
Hsu, W. J., and Du, M. W. 1984. New algorithms for the LCS problem. Journal of Computer and System Sciences 29: 133–52.CrossRefGoogle Scholar
IPA (International Phonetic Association). 1999. Handbook of the International Phonetic Association: A Guide to the Use of the International Phonetic Alphabet. Cambridge, England, UK: Cambridge University Press.Google Scholar
Jurafsky, D., & Martin, J. H. 2009. Speech and Language Processing: An Introduction to Natural Language Processing, Computational Linguistics, and Speech Recognition. Upper Saddle River, NJ: Prentice Hall.Google Scholar
Karagol-Ayan, B., Doermann, D., and Weinberg, A. 2006. Morphology induction from limited noisy data using approximate string matching. In Proceedings of the 8th Workshop of the ACL Special Interest Group in Computational Phonology (SIGPHON), pp. 60–8, New York, USA.Google Scholar
Khorsi, A. 2012. Effective unsupervised Arabic word stemming: towards an unsupervised radicals extraction. IAJIT 9 (6).Google Scholar
Larkey, L. S., Ballesteros, L., and Connell, M. E. 2002. Improving stemming for Arabic information retrieval: light stemming and co-occurrence analysis. In Proceedings of the 25th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 275–82, Tampere, Finland.CrossRefGoogle Scholar
Larkey, L., Ballesteros, L., and Connell, M. 2007. Light stemming for Arabic information retrieval. In Ide, N., Vronis, J., Baayen, H., Church, K. W., Klavans, J., Barnard, D. T., Tufis, D., Llisterri, J., Johansson, S., Mariani, J., Soudi, A., van den Bosch, A., & Neumann, G. (eds.), Arabic Computational Morphology, pp. 221–43. Text, Speech and Language Technology, vol. 38. Dordrecht, The Netherlands: Springer.CrossRefGoogle Scholar
Lee, Y.-S., Papineni, K., Roukos, S., Emam, O., and Hassan, H. 2003. Language model based Arabic word segmentation. In Proceedings of the 41st Annual Meeting on Association for Computational Linguistics - Volume 1, pp. 399406, Sapporo, Japan.Google Scholar
Menn, L. 1995. Non-Fluent Aphasia in a Multilingual World. Amsterdam: John Benjamins.CrossRefGoogle Scholar
Monson, C., Lavie, A., Carbonell, J., & Levin, L. 2004. Unsupervised induction of natural language morphology inflection classes. In Proceedings of the 7th Workshop of the ACL Special Interest Group in Computational Phonology (SIGPHON), pp. 5261, Barcelona, Spain.Google Scholar
Rogati, M., McCarley, S., & Yang, Y. 2003. Unsupervised learning of Arabic stemming using a parallel corpus. In Proceedings of the 41st Annual Meeting on Association for Computational Linguistics - Volume 1, pp. 391–98, Sapporo, Japan.Google Scholar
Sakai, T. 2007. On the reliability of information retrieval metrics based on graded relevance. Information Processing and Management 43 (2): 531–48.CrossRefGoogle Scholar
Schone, P., and Jurafsky, D. 2000. Knowledge-free induction of morphology using latent semantic analysis. In Cardie, C., Daelemans, W., Nédellec, C., and Sang, E. T. K. (eds.), Proceedings of the Fourth Conference on Computational Natural Language Learning and of the Second Learning Language in Logic Workshop, pp. 6772, Lisbon, Portugal.Google Scholar
Sharma, U., Kalita, J., and Das, R. 2002. Unsupervised learning of morphology for building lexicon for a highly inflectional language. In Proceedings of the 6th Workshop of the ACL Special Interest Group in Computational Phonology (SIGPHON), pp. 110, Philadelphia, USA.Google Scholar
Smrž, O. 2007. Elixirfm: implementation of functional Arabic morphology. In Semitic '07: Proceedings of the 2007 Workshop on Computational Approaches to Semitic Languages, pp. 18, Morristown, NJ, USA.Google Scholar
Snover, M. G., Jarosz, G. E., and Brent, M. R. 2002. Unsupervised learning of morphology using a novel directed search algorithm: taking the first step. In Proceedings of the ACL-02 Workshop on Morphological and Phonological Learning - Volume 6, pp. 1120, Philadelphia, PA, USA.CrossRefGoogle Scholar
Soudi, A., & van den Bosch, A. 2007. Arabic Computational Morphology: Knowledge-Based and Empirical Methods. Dordrecht, The Netherlands: Springer.CrossRefGoogle Scholar
Sproat, R. W. 1992. Morphology and Computation. Cambridge, MA: MIT Press.CrossRefGoogle Scholar
Xu, J., and Croft, W. B. 1998. Corpus-based stemming using cooccurrence of word variants. ACM Transactions on Information Systems 16 (1): 6181.CrossRefGoogle Scholar
. 1965. . .Google Scholar