The automatic identification of lexical variation between language varieties

YVES PEIRSMAN; DIRK GEERAERTS; DIRK SPEELMAN

doi:10.1017/S1351324910000161

The automatic identification of lexical variation between language varieties

Published online by Cambridge University Press: 11 October 2010

YVES PEIRSMAN ,

DIRK GEERAERTS and

DIRK SPEELMAN

Show author details

YVES PEIRSMAN: Affiliation:
Research Foundation – Flanders (FWO), Egmontstraat 5, 1000 Brussels, Belgium email: [email protected] Quantitative Lexicology and Variational Linguistics (QLVL), University of Leuven, Blijde-Inkomststraat 21 P.O. Box 3308, 3000 Leuven, Belgium email: [email protected], [email protected]
DIRK GEERAERTS: Affiliation:
Quantitative Lexicology and Variational Linguistics (QLVL), University of Leuven, Blijde-Inkomststraat 21 P.O. Box 3308, 3000 Leuven, Belgium email: [email protected], [email protected]
DIRK SPEELMAN: Affiliation:
Quantitative Lexicology and Variational Linguistics (QLVL), University of Leuven, Blijde-Inkomststraat 21 P.O. Box 3308, 3000 Leuven, Belgium email: [email protected], [email protected]

Article contents

Abstract
References

Get access

Rights & Permissions

Abstract

Languages are not uniform. Speakers of different language varieties use certain words differently – more or less frequently, or with different meanings. We argue that distributional semantics is the ideal framework for the investigation of such lexical variation. We address two research questions and present our analysis of the lexical variation between Belgian Dutch and Netherlandic Dutch. The first question involves a classic application of distributional models: the automatic retrieval of synonyms. We use corpora of two different language varieties to identify the Netherlandic Dutch synonyms for a set of typically Belgian words. Second, we address the problem of automatically identifying words that are typical of a given lect, either because of their high frequency or because of their divergent meaning. Overall, we show that distributional models are able to identify more lectal markers than traditional keyword methods. Distributional models also have a bias towards a different type of variation. In summary, our results demonstrate how distributional semantics can help research in variational linguistics, with possible future applications in lexicography or terminology extraction.

Type: Papers
Information: Natural Language Engineering , Volume 16 , Special Issue 4: Distributional Lexical Semantics , October 2010 , pp. 469 - 491

DOI: https://doi.org/10.1017/S1351324910000161 [Opens in a new window]
Copyright: Copyright © Cambridge University Press 2010

Access options

Get access to the full version of this content by using one of the access options below. (Log in options will check for institutional or personal access. Content may require purchase if you do not have access.)

Article purchase

Temporarily unavailable

References

Bai, J., Song, D., Bruza, P., Nie, J.-Y., and Cao, G. 2005. Query expansion using term relationships in language models for information retrieval. In Proceedings of the 14th ACM International Conference on Information and Knowledge Management (CIKM 2005), pp. 688–695. New York, NY: ACM Press.Google Scholar

Baroni, M., Lenci, A., and Onnis, L. 2007. ISA meets Lara: an incremental word space model for cognitively plausible simulations of semantic learning. In Proceedings of the ACL Workshop on Cognitive Aspects of Computational Language Acquisition, pp. 49–56. Stroudsburg, PA: Association for Computational Linguistics.Google Scholar

Bertels, A. 2008. Sémantique quantitative et corpus technique: des analyses statistiques aux interprétations linguistiques. In Actes des 9es Journées internationales d'Analyse statistique des Données Textuelles (JADT 2008), pp. 179–188. Lyon: Presses universitaires de Lyon.Google Scholar

Boussidan, A., Sagi, E., and Ploux, S. 2009. Phonaesthemic and etymological effects on the distribution of senses in statistical models of semantics. In Proceedings of the CogSci Workshop on Distributional Semantics Beyond Concrete Concepts (DiSCo 2009), pp. 35–40. http://www.let.rug.nl/disco2009/proc/disco2009_proceedings.pdf Google Scholar

Buchanan, L., Burgess, C., and Lund, K. 1996. Overcrowding in semantic neighborhoods: modeling deep dyslexia. Brain and Cognition 32: 111–114.Google Scholar

Burgess, C., Livesay, K., and Lund, K. 1998. Explorations in context space: words, sentences, discourse. Discourse Processes 25: 211–257.CrossRef Google Scholar

Curran, J. R. 2004. From Distributional to Semantic Similarity. PhD thesis, University of Edinburgh, Edinburgh, UK.Google Scholar

Den Boon, T., and Geeraerts, D. (Eds.). 2005. Van Dale Groot Woordenboek van de Nederlandse taal (14th ed.). Utrecht/Antwerp, Belgium: Van Dale Lexicografie.Google Scholar

Divjak, D., and Gries, S. T. 2006. Ways of trying in Russian: clustering behavioral profiles. Corpus Linguistics and Linguistic Theory 2 (1): 23–60.CrossRef Google Scholar

Dunning, T. 1993. Accurate methods for the statistics of surprise and coincidence. Computational Linguistics 19 (1): 61–74.Google Scholar

Firth, J. R. 1957. A synopsis of linguistic theory 1930–1955. In Firth, J. R. (ed.), Studies in Linguistic Analysis, pp. 1–32. Oxford: Blackwell.Google Scholar

Foltz, P. W. 1996. Latent semantic analysis for text-based research. Behavior Research Methods, Instruments and Computers 28 (2): 197–202.CrossRef Google Scholar

Fung, P., and McKeown, K. 1997. Finding terminology translations from non-parallel corpora. In Proceedings of the 5th Workshop on Very Large Corpora, pp. 192–202.Google Scholar

Fung, P., and Yee, L. Y. 1998. An IR approach for translating new words from non-parallel, comparable texts. In Proceedings of the 36th Annual Meeting of the Association for Computational Linguistics and the 17th International Conference on Computational Linguistics (COLING-ACL 1998), pp. 414–420.Google Scholar

Gamallo Otero, P. 2008. Evaluating two different methods for the task of extracting bilingual lexicons from comparable corpora. In Proceedings of the LREC-2008 Workshop on Comparable Corpora, pp. 19–26. Paris, France: European Language Resources Association.Google Scholar

Gamallo Otero, P., and Pichel Campos, J. R. 2008. Learning Spanish-Galician translation equivalents using a comparable corpus and a bilingual dictionary. In Computational Linguistics and Intelligent Text Processing, pp. 423–433. Lecture Notes in Computer Science, vol. 4919. New York, NY: Springer.Google Scholar

Geeraerts, D. 2010. Lexical variation in space. In Auer, P., and Schmidt, J. E. (eds.), Language and Space. An International Handbook of Linguistic Variation, pp. 820–836. Berlin: De Gruyter Mouton.Google Scholar

Geeraerts, D., Grondelaers, S., and Speelman, D. 1999. Convergentie en Divergentie in de Nederlandse Woordenschat. Amsterdam: Meertens Instituut.Google Scholar

Glynn, D. 2007. Mapping Meaning. Toward a Usage-Based Methodology in Cognitive Semantics. PhD thesis, University of Leuven, Leuven, Belgium.Google Scholar

Grefenstette, G. 1994. Explorations in Automatic Thesaurus Discovery. Dordrecht: Kluwer.Google Scholar

Gries, S. T. 2006. Corpus-based methods and cognitive semantics: the many meanings of to run. In Gries, S. T., and Stefanowitsch, A. (eds.), Corpora in Cognitive Linguistics: Corpus-Based Approaches to Syntax and Lexis, pp. 57–99. Berlin: Mouton de Gruyter.CrossRef Google Scholar

Harris, Z. 1954. Distributional structure. Word 10 (2/3): 146–162.Google Scholar

Jijkoun, V., and de Rijke, M. 2005. Recognizing textual entailment: is word similarity enough? In Quinonero-Candela, J., Dagan, I., Magnini, B., and d'Alché-Buc, F. (eds.), Machine Learning Challenges, Evaluating Predictive Uncertainty, Visual Object Classification and Recognizing Textual Entailment, First PASCAL Machine Learning Challenges Workshop (MLCW 2005), Lecture Notes in Computer Science 3944, pp. 449–460. New York, NY: Springer.Google Scholar

Kakkonen, T., Myller, N., Timonen, J., and Sutinen, E. 2005. Automatic essay grading with probabilistic latent semantic analysis. In Proceedings of the 2nd Workshop on Building Educational Applications Using NLP (EdAppsNLP 05), pp. 29–36. Stroudsburg, PA: Association for Computational Linguistics.Google Scholar

Kilgarriff, A. 2001. Comparing corpora. International Journal of Corpus Linguistics 6 (1): 97–133.Google Scholar

Kilgarriff, A. 2005. Language is never ever ever random. Corpus Linguistics and Linguistic Theory 1 (2): 263–276.CrossRef Google Scholar

Kintsch, W. 2000. Metaphor comprehension: a computational theory. Psychonomic Bulletin & Review 7: 257–266.Google Scholar

Landauer, T. K., and Dumais, S. T. 1997. A solution to Plato's problem: the latent semantic analysis theory of acquisition, induction and representation of knowledge. Psychological Review 104 (2): 211–240.Google Scholar

Lin, D. 1998. Automatic retrieval and clustering of similar words. In Proceedings of the 36th Annual Meeting of the Association for Computational Linguistics and the 17th International Conference on Computational Linguistics (COLING-ACL 1998), pp. 768–774. Stroudsburg, PA: Association for Computational Linguistics.Google Scholar

Lowe, W., and McDonald, S. 2000. The direct route: mediated priming in semantic space. In Proceedings of the 22nd Annual Conference of the Cognitive Science Society (CogSci 2000), pp. 675–680. Wheat Ridge, CO: Cognitive Science Society.Google Scholar

Martin, W. 2005. Het Belgisch-Nederlands anders bekeken: het Referentiebestand Belgisch-Nederlands (RBBN). Technical report, Vrije Universiteit Amsterdam, Amsterdam, the Netherlands.Google Scholar

Mitchell, T. M., Shinkareva, S. V., Carlson, A., Chang, K.-M., Malave, V. L., Mason, R. A., and Just, M. A. 2008. Predicting human brain activity associated with the meanings of nouns. Science 320 (5880): 1191–1195.Google Scholar

Mohammad, S., Gurevych, I., Hirst, G., and Zesch, T. 2007. Cross-lingual distributional profiles of concepts for measuring semantic distance. In Proceedings of the 2007 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning (EMNLP-CoNLL 2007), pp. 571–580. Stroudsburg, PA: Association for Computational Linguistics.Google Scholar

Ordelman, R., de Jong, F., van Hessen, A., and Hondorp, G. 2007. TwNC: a multifaceted Dutch news corpus. ELRA Newsletter 12 (3–4): 1–9.Google Scholar

Padó, S., and Lapata, M. 2007. Dependency-based construction of semantic space models. Computational Linguistics 33 (2): 161–199.Google Scholar

Pantel, P., and Lin, D. 2002. Discovering word senses from text. In Proceedings of the 8th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD 2002), pp. 613–619. New York, NY: ACM Press.Google Scholar

Peirsman, Y., and Geeraerts, D. 2009. Predicting strong associations on the basis of corpus data. In Proceedings of the 12th Conference of the European Chapter of the Association for Computational Linguistics (EACL 2009), pp. 648–656. Stroudsburg, PA: Association for Computational Linguistics.Google Scholar

Peirsman, Y., Heylen, K., and Speelman, D. 2007. Finding semantically related words in Dutch. Co-occurrences versus syntactic contexts. In Proceedings of the Workshop on Contextual Information in Semantic Space Models (CoSMO 2007), pp. 34–41. http://clic.cimec.unitn.it/marco/beyond_words/proceedings/proceedingsCosmo.pdf Google Scholar

Rapp, R. 1995. Identifying word translations in non-parallel texts. In Proceedings of the 33rd Annual Meeting of the Association for Computational Linguistics (ACL 1995), pp. 320–322. Stroudsburg, PA: Association for Computational Linguistics.Google Scholar

Rapp, R. 1999. Automatic identification of word translations from unrelated English and German corpora. In Proceedings of the 37th Annual Meeting of the Association for Computational Linguistics (ACL 1999), pp. 519–526. Stroudsburg, PA: Association for Computational Linguistics.Google Scholar

Rayson, P., Berridge, D., and Francis, B. 2004. Extending the Cochran Rule for the comparison of word frequencies between corpora. In Le poids des mots. Actes des 7es Journées internationales d'Analyse statistique des Données Textuelles (JADT 2004), pp. 926–936. Louvain-la-Neuve, Belgium: Presses universitaires de Louvain.Google Scholar

Sagi, E., Kaufmann, S., and Clark, B. 2009. Semantic density analysis: comparing word meaning across time and phonetic space. In Proceedings of the EACL 2009 Workshop on GEMS: Geometrical Models of Natural Language Semantics, Stroudsburg, pp. 104–111. PA: Association for Computational Linguistics.Google Scholar

Sahlgren, M. 2006. The Word-Space Model. Using Distributional Analysis to Represent Syntagmatic and Paradigmatic Relations Between Words in High-dimensional Vector Spaces. PhD thesis, Stockholm University, Stockholm, Sweden.Google Scholar

Salton, G., and McGill, M. J. 1983. Introduction to Modern Information Retrieval. New York, NY: McGraw-Hill.Google Scholar

Schütze, H. 1998. Automatic word sense discrimination. Computational Linguistics 24 (1): 97–124.Google Scholar

Scott, M. 1997. PC analysis of key words – and key key words. System 25 (2): 233–245.CrossRef Google Scholar

Soares da Silva, A. 2010. Measuring and parameterizing lexical convergence and divergence between European and Brazilian Portuguese: endo/exogeneousness and foreign and normative influence. In Geeraerts, D., Kristiansen, G., and Peirsman, Y. (eds.), Advances in Cognitive Sociolinguistics. Berlin: De Gruyter Mouton.Google Scholar

Speelman, D., Grondelaers, S., and Geeraerts, D. 2003. Profile-based linguistic uniformity as a generic method for comparing language varieties. Computers and the Humanities 37: 317–337.Google Scholar

Szmrecsanyi, B. 2010. The English genitive alternation in a cognitive sociolinguistics perspective. In Geeraerts, D., Kristiansen, G., and Peirsman, Y. (eds.), Advances in Cognitive Sociolinguistics. Berlin: De Gruyter Mouton.Google Scholar

Tummers, J., Heylen, K., and Geeraerts, D. 2005. Usage-based approaches in cognitive linguistics: a technical state of the art. Corpus Linguistics and Linguistic Theory 1: 225–261.Google Scholar

Turney, P., and Pantel, P. 2010. From frequency to meaning: vector space models of semantics. Journal of Artificial Intelligence Research 37: 141–188.CrossRef Google Scholar

Van de Cruys, T. 2008. A comparison of bag of words and syntax-based approaches for word categorization. In Baroni, M., Evert, S., and Lenci, A. (eds.), Proceedings of the ESSLLI Workshop on Distributional Lexical Semantics, pp. 47–54. http://wordspace.collocations.de/lib/exe/fetch.php/workshop:esslli:esslli_2008_lexicalsemantics.pdf Google Scholar

Van der Plas, L. 2008. Automatic Lexico-Semantic Acquisition for Question Answering. PhD thesis, University of Groningen, Groningen, the Netherlands.Google Scholar

Wittgenstein, L. 1953. Philosophical Investigations. Oxford: Blackwell.Google Scholar

Wulff, S., Stefanowitsch, A., and Gries, S. T. 2007. Brutal Brits and persuasive Americans: variety-specific meaning construction in the into-causative. In Radden, G., Köpcke, K.-M., Berg, T., and Siemund, P. (eds.), Aspects of Meaning Construction in Lexicon and Grammar, pp. 265–281. Amsterdam: John Benjamins.Google Scholar

Zhitomirsky-Geffet, M., and Dagan, I. 2009. Bootstrapping distributional feature vector quality. Computational Linguistics 35 (3): 435–461.Google Scholar

Article contents

The automatic identification of lexical variation between language varieties

Abstract

Access options

Article purchase

Temporarily unavailable

References

Save article to Kindle

Save article to Dropbox

Save article to Google Drive

Reply to: Submit a response

Your details

You have entered the maximum number of contributors

Conflicting interests