Hostname: page-component-586b7cd67f-l7hp2 Total loading time: 0 Render date: 2024-11-22T04:44:00.013Z Has data issue: false hasContentIssue false

The automatic identification of lexical variation between language varieties

Published online by Cambridge University Press:  11 October 2010

YVES PEIRSMAN
Affiliation:
Research Foundation – Flanders (FWO), Egmontstraat 5, 1000 Brussels, Belgium email: [email protected] Quantitative Lexicology and Variational Linguistics (QLVL), University of Leuven, Blijde-Inkomststraat 21 P.O. Box 3308, 3000 Leuven, Belgium email: [email protected], [email protected]
DIRK GEERAERTS
Affiliation:
Quantitative Lexicology and Variational Linguistics (QLVL), University of Leuven, Blijde-Inkomststraat 21 P.O. Box 3308, 3000 Leuven, Belgium email: [email protected], [email protected]
DIRK SPEELMAN
Affiliation:
Quantitative Lexicology and Variational Linguistics (QLVL), University of Leuven, Blijde-Inkomststraat 21 P.O. Box 3308, 3000 Leuven, Belgium email: [email protected], [email protected]

Abstract

Languages are not uniform. Speakers of different language varieties use certain words differently – more or less frequently, or with different meanings. We argue that distributional semantics is the ideal framework for the investigation of such lexical variation. We address two research questions and present our analysis of the lexical variation between Belgian Dutch and Netherlandic Dutch. The first question involves a classic application of distributional models: the automatic retrieval of synonyms. We use corpora of two different language varieties to identify the Netherlandic Dutch synonyms for a set of typically Belgian words. Second, we address the problem of automatically identifying words that are typical of a given lect, either because of their high frequency or because of their divergent meaning. Overall, we show that distributional models are able to identify more lectal markers than traditional keyword methods. Distributional models also have a bias towards a different type of variation. In summary, our results demonstrate how distributional semantics can help research in variational linguistics, with possible future applications in lexicography or terminology extraction.

Type
Papers
Copyright
Copyright © Cambridge University Press 2010

Access options

Get access to the full version of this content by using one of the access options below. (Log in options will check for institutional or personal access. Content may require purchase if you do not have access.)

References

Bai, J., Song, D., Bruza, P., Nie, J.-Y., and Cao, G. 2005. Query expansion using term relationships in language models for information retrieval. In Proceedings of the 14th ACM International Conference on Information and Knowledge Management (CIKM 2005), pp. 688695. New York, NY: ACM Press.Google Scholar
Baroni, M., Lenci, A., and Onnis, L. 2007. ISA meets Lara: an incremental word space model for cognitively plausible simulations of semantic learning. In Proceedings of the ACL Workshop on Cognitive Aspects of Computational Language Acquisition, pp. 4956. Stroudsburg, PA: Association for Computational Linguistics.Google Scholar
Bertels, A. 2008. Sémantique quantitative et corpus technique: des analyses statistiques aux interprétations linguistiques. In Actes des 9es Journées internationales d'Analyse statistique des Données Textuelles (JADT 2008), pp. 179188. Lyon: Presses universitaires de Lyon.Google Scholar
Boussidan, A., Sagi, E., and Ploux, S. 2009. Phonaesthemic and etymological effects on the distribution of senses in statistical models of semantics. In Proceedings of the CogSci Workshop on Distributional Semantics Beyond Concrete Concepts (DiSCo 2009), pp. 35–40. http://www.let.rug.nl/disco2009/proc/disco2009_proceedings.pdfGoogle Scholar
Buchanan, L., Burgess, C., and Lund, K. 1996. Overcrowding in semantic neighborhoods: modeling deep dyslexia. Brain and Cognition 32: 111114.Google Scholar
Burgess, C., Livesay, K., and Lund, K. 1998. Explorations in context space: words, sentences, discourse. Discourse Processes 25: 211257.CrossRefGoogle Scholar
Curran, J. R. 2004. From Distributional to Semantic Similarity. PhD thesis, University of Edinburgh, Edinburgh, UK.Google Scholar
Den Boon, T., and Geeraerts, D. (Eds.). 2005. Van Dale Groot Woordenboek van de Nederlandse taal (14th ed.). Utrecht/Antwerp, Belgium: Van Dale Lexicografie.Google Scholar
Divjak, D., and Gries, S. T. 2006. Ways of trying in Russian: clustering behavioral profiles. Corpus Linguistics and Linguistic Theory 2 (1): 2360.CrossRefGoogle Scholar
Dunning, T. 1993. Accurate methods for the statistics of surprise and coincidence. Computational Linguistics 19 (1): 6174.Google Scholar
Firth, J. R. 1957. A synopsis of linguistic theory 1930–1955. In Firth, J. R. (ed.), Studies in Linguistic Analysis, pp. 132. Oxford: Blackwell.Google Scholar
Foltz, P. W. 1996. Latent semantic analysis for text-based research. Behavior Research Methods, Instruments and Computers 28 (2): 197202.CrossRefGoogle Scholar
Fung, P., and McKeown, K. 1997. Finding terminology translations from non-parallel corpora. In Proceedings of the 5th Workshop on Very Large Corpora, pp. 192–202.Google Scholar
Fung, P., and Yee, L. Y. 1998. An IR approach for translating new words from non-parallel, comparable texts. In Proceedings of the 36th Annual Meeting of the Association for Computational Linguistics and the 17th International Conference on Computational Linguistics (COLING-ACL 1998), pp. 414–420.Google Scholar
Gamallo Otero, P. 2008. Evaluating two different methods for the task of extracting bilingual lexicons from comparable corpora. In Proceedings of the LREC-2008 Workshop on Comparable Corpora, pp. 1926. Paris, France: European Language Resources Association.Google Scholar
Gamallo Otero, P., and Pichel Campos, J. R. 2008. Learning Spanish-Galician translation equivalents using a comparable corpus and a bilingual dictionary. In Computational Linguistics and Intelligent Text Processing, pp. 423433. Lecture Notes in Computer Science, vol. 4919. New York, NY: Springer.Google Scholar
Geeraerts, D. 2010. Lexical variation in space. In Auer, P., and Schmidt, J. E. (eds.), Language and Space. An International Handbook of Linguistic Variation, pp. 820836. Berlin: De Gruyter Mouton.Google Scholar
Geeraerts, D., Grondelaers, S., and Speelman, D. 1999. Convergentie en Divergentie in de Nederlandse Woordenschat. Amsterdam: Meertens Instituut.Google Scholar
Glynn, D. 2007. Mapping Meaning. Toward a Usage-Based Methodology in Cognitive Semantics. PhD thesis, University of Leuven, Leuven, Belgium.Google Scholar
Grefenstette, G. 1994. Explorations in Automatic Thesaurus Discovery. Dordrecht: Kluwer.Google Scholar
Gries, S. T. 2006. Corpus-based methods and cognitive semantics: the many meanings of to run. In Gries, S. T., and Stefanowitsch, A. (eds.), Corpora in Cognitive Linguistics: Corpus-Based Approaches to Syntax and Lexis, pp. 5799. Berlin: Mouton de Gruyter.CrossRefGoogle Scholar
Harris, Z. 1954. Distributional structure. Word 10 (2/3): 146162.Google Scholar
Jijkoun, V., and de Rijke, M. 2005. Recognizing textual entailment: is word similarity enough? In Quinonero-Candela, J., Dagan, I., Magnini, B., and d'Alché-Buc, F. (eds.), Machine Learning Challenges, Evaluating Predictive Uncertainty, Visual Object Classification and Recognizing Textual Entailment, First PASCAL Machine Learning Challenges Workshop (MLCW 2005), Lecture Notes in Computer Science 3944, pp. 449460. New York, NY: Springer.Google Scholar
Kakkonen, T., Myller, N., Timonen, J., and Sutinen, E. 2005. Automatic essay grading with probabilistic latent semantic analysis. In Proceedings of the 2nd Workshop on Building Educational Applications Using NLP (EdAppsNLP 05), pp. 2936. Stroudsburg, PA: Association for Computational Linguistics.Google Scholar
Kilgarriff, A. 2001. Comparing corpora. International Journal of Corpus Linguistics 6 (1): 97133.Google Scholar
Kilgarriff, A. 2005. Language is never ever ever random. Corpus Linguistics and Linguistic Theory 1 (2): 263276.CrossRefGoogle Scholar
Kintsch, W. 2000. Metaphor comprehension: a computational theory. Psychonomic Bulletin & Review 7: 257266.Google Scholar
Landauer, T. K., and Dumais, S. T. 1997. A solution to Plato's problem: the latent semantic analysis theory of acquisition, induction and representation of knowledge. Psychological Review 104 (2): 211240.Google Scholar
Lin, D. 1998. Automatic retrieval and clustering of similar words. In Proceedings of the 36th Annual Meeting of the Association for Computational Linguistics and the 17th International Conference on Computational Linguistics (COLING-ACL 1998), pp. 768774. Stroudsburg, PA: Association for Computational Linguistics.Google Scholar
Lowe, W., and McDonald, S. 2000. The direct route: mediated priming in semantic space. In Proceedings of the 22nd Annual Conference of the Cognitive Science Society (CogSci 2000), pp. 675680. Wheat Ridge, CO: Cognitive Science Society.Google Scholar
Martin, W. 2005. Het Belgisch-Nederlands anders bekeken: het Referentiebestand Belgisch-Nederlands (RBBN). Technical report, Vrije Universiteit Amsterdam, Amsterdam, the Netherlands.Google Scholar
Mitchell, T. M., Shinkareva, S. V., Carlson, A., Chang, K.-M., Malave, V. L., Mason, R. A., and Just, M. A. 2008. Predicting human brain activity associated with the meanings of nouns. Science 320 (5880): 11911195.Google Scholar
Mohammad, S., Gurevych, I., Hirst, G., and Zesch, T. 2007. Cross-lingual distributional profiles of concepts for measuring semantic distance. In Proceedings of the 2007 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning (EMNLP-CoNLL 2007), pp. 571580. Stroudsburg, PA: Association for Computational Linguistics.Google Scholar
Ordelman, R., de Jong, F., van Hessen, A., and Hondorp, G. 2007. TwNC: a multifaceted Dutch news corpus. ELRA Newsletter 12 (3–4): 19.Google Scholar
Padó, S., and Lapata, M. 2007. Dependency-based construction of semantic space models. Computational Linguistics 33 (2): 161199.Google Scholar
Pantel, P., and Lin, D. 2002. Discovering word senses from text. In Proceedings of the 8th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD 2002), pp. 613619. New York, NY: ACM Press.Google Scholar
Peirsman, Y., and Geeraerts, D. 2009. Predicting strong associations on the basis of corpus data. In Proceedings of the 12th Conference of the European Chapter of the Association for Computational Linguistics (EACL 2009), pp. 648656. Stroudsburg, PA: Association for Computational Linguistics.Google Scholar
Peirsman, Y., Heylen, K., and Speelman, D. 2007. Finding semantically related words in Dutch. Co-occurrences versus syntactic contexts. In Proceedings of the Workshop on Contextual Information in Semantic Space Models (CoSMO 2007), pp. 34–41. http://clic.cimec.unitn.it/marco/beyond_words/proceedings/proceedingsCosmo.pdfGoogle Scholar
Rapp, R. 1995. Identifying word translations in non-parallel texts. In Proceedings of the 33rd Annual Meeting of the Association for Computational Linguistics (ACL 1995), pp. 320322. Stroudsburg, PA: Association for Computational Linguistics.Google Scholar
Rapp, R. 1999. Automatic identification of word translations from unrelated English and German corpora. In Proceedings of the 37th Annual Meeting of the Association for Computational Linguistics (ACL 1999), pp. 519526. Stroudsburg, PA: Association for Computational Linguistics.Google Scholar
Rayson, P., Berridge, D., and Francis, B. 2004. Extending the Cochran Rule for the comparison of word frequencies between corpora. In Le poids des mots. Actes des 7es Journées internationales d'Analyse statistique des Données Textuelles (JADT 2004), pp. 926936. Louvain-la-Neuve, Belgium: Presses universitaires de Louvain.Google Scholar
Sagi, E., Kaufmann, S., and Clark, B. 2009. Semantic density analysis: comparing word meaning across time and phonetic space. In Proceedings of the EACL 2009 Workshop on GEMS: Geometrical Models of Natural Language Semantics, Stroudsburg, pp. 104111. PA: Association for Computational Linguistics.Google Scholar
Sahlgren, M. 2006. The Word-Space Model. Using Distributional Analysis to Represent Syntagmatic and Paradigmatic Relations Between Words in High-dimensional Vector Spaces. PhD thesis, Stockholm University, Stockholm, Sweden.Google Scholar
Salton, G., and McGill, M. J. 1983. Introduction to Modern Information Retrieval. New York, NY: McGraw-Hill.Google Scholar
Schütze, H. 1998. Automatic word sense discrimination. Computational Linguistics 24 (1): 97124.Google Scholar
Scott, M. 1997. PC analysis of key words – and key key words. System 25 (2): 233245.CrossRefGoogle Scholar
Soares da Silva, A. 2010. Measuring and parameterizing lexical convergence and divergence between European and Brazilian Portuguese: endo/exogeneousness and foreign and normative influence. In Geeraerts, D., Kristiansen, G., and Peirsman, Y. (eds.), Advances in Cognitive Sociolinguistics. Berlin: De Gruyter Mouton.Google Scholar
Speelman, D., Grondelaers, S., and Geeraerts, D. 2003. Profile-based linguistic uniformity as a generic method for comparing language varieties. Computers and the Humanities 37: 317337.Google Scholar
Szmrecsanyi, B. 2010. The English genitive alternation in a cognitive sociolinguistics perspective. In Geeraerts, D., Kristiansen, G., and Peirsman, Y. (eds.), Advances in Cognitive Sociolinguistics. Berlin: De Gruyter Mouton.Google Scholar
Tummers, J., Heylen, K., and Geeraerts, D. 2005. Usage-based approaches in cognitive linguistics: a technical state of the art. Corpus Linguistics and Linguistic Theory 1: 225261.Google Scholar
Turney, P., and Pantel, P. 2010. From frequency to meaning: vector space models of semantics. Journal of Artificial Intelligence Research 37: 141188.CrossRefGoogle Scholar
Van de Cruys, T. 2008. A comparison of bag of words and syntax-based approaches for word categorization. In Baroni, M., Evert, S., and Lenci, A. (eds.), Proceedings of the ESSLLI Workshop on Distributional Lexical Semantics, pp. 47–54. http://wordspace.collocations.de/lib/exe/fetch.php/workshop:esslli:esslli_2008_lexicalsemantics.pdfGoogle Scholar
Van der Plas, L. 2008. Automatic Lexico-Semantic Acquisition for Question Answering. PhD thesis, University of Groningen, Groningen, the Netherlands.Google Scholar
Wittgenstein, L. 1953. Philosophical Investigations. Oxford: Blackwell.Google Scholar
Wulff, S., Stefanowitsch, A., and Gries, S. T. 2007. Brutal Brits and persuasive Americans: variety-specific meaning construction in the into-causative. In Radden, G., Köpcke, K.-M., Berg, T., and Siemund, P. (eds.), Aspects of Meaning Construction in Lexicon and Grammar, pp. 265281. Amsterdam: John Benjamins.Google Scholar
Zhitomirsky-Geffet, M., and Dagan, I. 2009. Bootstrapping distributional feature vector quality. Computational Linguistics 35 (3): 435461.Google Scholar