Hostname: page-component-cd9895bd7-dzt6s Total loading time: 0 Render date: 2024-12-24T12:11:25.617Z Has data issue: false hasContentIssue false

Lörres, Möppes, and the Swiss. (Re)Discovering regional patterns in anonymous social media data

Published online by Cambridge University Press:  12 December 2019

Christoph Purschke*
Affiliation:
University of Luxembourg, Esch-sur-Alzette, Luxembourg
Dirk Hovy
Affiliation:
Bocconi University, Milan, Italy
*
Author for correspondence: Christoph Purschke, Email: [email protected]

Abstract

We study regional similarities and differences in language use on an anonymous mobile chat application in the German-speaking area. We use a neural network on 2.3 million online conversations to automatically learn representations of words and cities. These linguistic-use-based representations capture regional distinctions in a high-dimensional vector space that can be clustered and visualized to discover patterns in the data. We find that the resulting regional patterns are closely linked to the traditional division of German dialects, even though most of the conversations are written in standard German. The resulting maps correspond to traditional dialect divisions and language-external spatial structures, with a few notable exceptions that can be explained through external factors.

Our method also facilitates two qualitative analyses, allowing us to discover geographically-pertinent words for various regional levels, as well as creating regional group-specific style profiles based on various linguistic resources. The results of our study strongly suggest the existence of region-specific patterns of language use (“digital regiolects”) representing distinctive strategies of linguistic stylization in relation to linguistic resources and topics. As a methodological contribution, we show how linguistic theory can drive the application and direction of neural network-based representation learning, and how their judicious application provides the basis for qualitative analysis of large-scale data collections.

Type
Articles
Copyright
© Cambridge University Press 2019 

Access options

Get access to the full version of this content by using one of the access options below. (Log in options will check for institutional or personal access. Content may require purchase if you do not have access.)

References

Androutsopoulos, Jannis. 2003. Online-Gemeinschaften und Sprachvariation. Soziolinguistische Perspektiven auf Sprache im Internet. Zeitschrift für germanistische Linguistik 31(2): Deutsche Sprache in Gegenwart und Geschichte. 173197.CrossRefGoogle Scholar
Androutsopoulos, Jannis. 2007. Neue Medien. Neue Schriftlichkeit? Mitteilungen des Germanistenverbandes 54 (1): Medialität und Sprache. 7297.Google Scholar
Androutsopoulos, Jannis. 2013. Online data collection. In Mallinson, Christine, Childs, Becky & Herk, Gerard Van (eds.), Data collection in sociolinguistics: Methods and applications, 236249. London: Routledge.Google Scholar
Bamman, David, Dyer, Chris & Smith, Noah. 2014a. Distributed representations of geographically situated language. In Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (June 22–27, 2014). Volume 2: Short Papers. 828834. Baltimore: Association for Computational Linguistics.Google Scholar
Bamman, David, Eisenstein, Jacob & Schnoebelen, Tyler. 2014b. Gender identity and lexical variation in social media. Journal of Sociolinguistics 18(2). 135160.CrossRefGoogle Scholar
Barton, David & Lee, Carmen. 2013. Language online: Investigating digital texts and practices. London/New York: Routledge.CrossRefGoogle Scholar
Cheshire, Jenny. 2005. Syntactic variation and beyond: Gender and social class variation in the use of discourse-new markers. Journal of Sociolinguistics 9(4). 479508.CrossRefGoogle Scholar
Coupland, Nikolas. 2007. Style: Language variation and identity. Cambridge: Cambridge University Press.CrossRefGoogle Scholar
Doyle, Gabriel. 2014. Mapping dialectal variation by querying social media. In Proceedings of the 14th Conference of the European Chapter of the Association for Computational Linguistics (April 26–30, 2014). 98106. Gothenburg: Association for Computational Linguistics.CrossRefGoogle Scholar
Dürscheid, Christa & Frick, Karina. 2016. Schreiben digital. Wie das Internet unsere Alltagskommunikation verändert. Stuttgart: Kröner Verlag.Google Scholar
Dürscheid, Christa & Stark, Elisabeth. 2013. Anything goes? SMS, phonographisches Schreiben und Morphemkonstanz. In Neef, Martin & Scherer, Carmen (eds.), Die Schnittstelle von Morphologie und geschriebener Sprache, 189210. Berlin: De Gruyter.Google Scholar
Eisenstein, Jacob. 2015. Systematic patterning in phonologically-motivated orthographic variation. Journal of Sociolinguistics 19(2). 161188.CrossRefGoogle Scholar
Eisenstein, Jacob 2013a. Phonological factors in social media writing. In Proceedings of the Workshop on Language Analysis in Social Media (June 13, 2013). 1119. Atlanta: Association for Computational Linguistics.Google Scholar
Eisenstein, Jacob. 2013b. What to do about bad language on the internet. In Proceedings of the 2013 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (June 9–14, 2013). 359369. Atlanta: Association for Computational Linguistics.Google Scholar
Eisenstein, Jacob, O’Connor, Brendan, Smith, Noah & Xing, Eric. 2010. A latent variable model for geographic lexical variation. In Proceedings of the 2010 Conference on Empirical Methods in Natural Language Processing (October 9–11, 2010). 12771287. Cambridge, Massachusetts (USA): Association for Computational Linguistics.Google Scholar
Eisenstein, Jacob, Smith, Noah & Xing, Eric. 2011. Discovering sociolinguistic associations with structured sparsity. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies (June 19–24, 2011). Volume 1. 13651374. Portland, Oregon (USA): Association for Computational Linguistics.Google Scholar
Falck, Oliver, Heblich, Stephan, Lameli, Alfred & Südekum, Jens. 2012. Dialects, cultural identity, and economic exchange. Journal of Urban Economics 72. 225239.CrossRefGoogle Scholar
Goldberg, Yoav. 2017. Neural network methods for natural language processing. Synthesis Lectures on Human Language Technologies 10(1). San Rafael, California (USA): Morgan & Claypool Publishers.Google Scholar
Granovetter, Mark. 1973. The strength of weak ties. American Journal of Sociology 78(6). 13601380.CrossRefGoogle Scholar
Grieve, Jack, Speelman, Dirk & Geeraerts, Dirk. 2011. A statistical method for the identification and aggregation of regional linguistic variation. Language Variation and Change 23(2). 193221.CrossRefGoogle Scholar
Heblich, Stephan, Lameli, Alfred & Riener, Gerhard. 2015. The impact of regional accents on economic behavior: A lab experiment on linguistic performance, cognitive ratings and economic decisions. PLoS ONE 10(2). e0113475. https://doi.org/10.1371/journal.pone.0113475CrossRefGoogle Scholar
Herring, Susan. 2013. Discourse in Web 2.0: Familiar, reconfigured, and emergent. In Tannen, Deborah & Trester, Anna (eds.), Discourse 2.0: Language and New Media, 125. Washington: Georgetown University Press.Google Scholar
Hessisches Statistisches Lansesamt. 2018. Studierende und Gasthörer an den Hochschulen in Hessen im Wintersemester 2017/18. Wiesbaden: Hessisches Statistisches Landesamt.Google Scholar
Hovy, Dirk, Rahimi, Afshin, Brooke, Julian & Baldwin, Tim. 2019. Visualizing Regional Language Variation Across Europe on Twitter. In Stanley Brunn & Roland Kehrein (eds.), Handbook of the Changing World Language Map, 124. Cham: Springer.Google Scholar
Hovy, Dirk, Johannsen, Anders & Søgaard, Anders. 2015. User review-sites as a source for large-scale sociolinguistic studies. In Proceedings of the 24th International Conference on World Wide Web (May 18–22, 2015). 452461. Florence: International World Wide Web Conferences Steering Committee.CrossRefGoogle Scholar
Hovy, Dirk, & Johannsen, Anders. 2016. Exploring language variation across Europe. In Proceedings of the Tenth International Conference on Language Resources and Evaluation (May 23–28, 2016). Portorož (Slovenia): European Language Resources Association (ELRA).Google Scholar
Johannsen, Anders, Hovy, Dirk & Søgaard, Anders. 2015. Cross-lingual syntactic variation over age and gender. In Proceedings of the Nineteenth Conference on Computational Natural Language Learning (July 30–31, 2015). 103112. Beijing: Association for Computational Linguistics.CrossRefGoogle Scholar
Jones, Tyler. 2015. Toward a description of African American vernacular English dialect regions using “Black Twitter.” American Speech 90(4). 403440.CrossRefGoogle Scholar
Kehrein, Roland. 2012. Regionalsprachliche Spektren im Raum—zur linguistischen Struktur der Vertikale. (ZDL. Beihefte 152). Stuttgart: Steiner.Google Scholar
Kitchin, Rob. 2014. Big Data, new epistemologies and paradigm shifts. Big Data & Society, 1(1). 112.CrossRefGoogle Scholar
Kleene, Andrea. 2017. Attitudinal-perzeptive Variationslinguistik im bairischen Sprachraum. Horizontale und vertikale Grenzen aus der Hörerperspektive. Vienna, Austria: University of Vienna Dissertation.Google Scholar
Koch, Peter & Oesterreicher, Wolf. 1985. Sprache der Nähe—Sprache der Distanz. Mündlichkeit und Schriftlichkeit im Spannungsfeld von Sprachtheorie und Sprachgeschichte. Romanistisches Jahrbuch 36. 1543.Google Scholar
Kristiansen, Tore. 2009. The macro-level social meanings of late-modern Danish accents. Acta Linguistica Hafniensia 41. 167192.CrossRefGoogle Scholar
Kulkarni, Vivek, Perozzi, Bryan, & Skiena, Steven. 2016. Freshman or fresher? Quantifying the geographic variation of language in online social media. Proceedings of the Tenth International AAAI Conference on Web and Social Media (May 17–20, 2016). 615618. Cologne: Association for the Advancement for Artificial Intelligence.Google Scholar
Lameli, Alfred. 2013. Strukturen im Sprachraum: Analysen zur arealtypologischen Komplexität der Dialekte in Deutschland. (Linguistik—Impulse und Tendenzen 54). Berlin/Boston: De Gruyter.CrossRefGoogle Scholar
Lameli, Alfred, Nitsch, Volker, Südekum, Jens & Wolf, Nikolaus. 2015. Same same but different: Dialects and trade. German Economic Review 16(3). 290306.CrossRefGoogle Scholar
Landauer, Thomas & Dumais, Susan. 1997. A solution to Plato’s problem: The latent semantic analysis theory of acquisition, induction and representation of knowledge. Psychological Review 104(2). 211240.CrossRefGoogle Scholar
Lau, Jey Han & Baldwin, Timothy. 2016. An empirical evaluation of doc2vec with practical insights into document embedding generation. In Proceedings of the 1st Workshop on Representation Learning for NLP (August 11, 2016). 7886. Berlin: Association for Computational Linguistics.CrossRefGoogle Scholar
Le, Quoc & Mikolov, Tomas. 2014. Distributed representations of sentences and documents. Proceedings of the 31st International Conference on Machine Learning (June 21–26, 2014). 11881196. Beijing: JMLR, Inc.Google Scholar
Leemann, Adrian, Kolly, Marie-José, Purves, Ross, Britain, David & Glaser, Elvira. 2016. Crowdsourcing language change with smartphone applications. PLoS ONE 11(1). e0143060. https://doi.org/10.1371/journal.pone.0143060CrossRefGoogle ScholarPubMed
Leemann, Adrian, Kolly, Marie-José, Schmid, Stephan & Dellwo, Volker (eds.). 2015. Trends in phonetics and phonology: Studies from German-speaking Europe. Frankfurt am Main: Peter Lang.Google Scholar
Lesław, Tobiasz. 2015. Die sprachliche Vielfalt Graubündens—ein Phänomen in der viersprachigen Schweiz. Linguistica Silesiana 36. 209230.Google Scholar
Nerbonne, John & Heeringa, Wilbert. 1997. Measuring dialect distance phonetically. In Computational Phonology: Third Meeting of the ACL Special Interest Group in Computational Phonology (July 12, 1997). 1118. Madrid: Association for Computational Linguistics.Google Scholar
Nguyen, Dong. 2017. Text as social and cultural data: A computational perspective on variation in text. Enschede: Universiteit Twente. DOI: 10.3990/1.9789036543002Google Scholar
Nguyen, Dong, Doğruöz, Seza, Rosé, Carolyn & Jong, Franciska de. 2016. Computational sociolinguistics: A survey. Computational Linguistics, 42(3). 537593.CrossRefGoogle Scholar
Östling, Robert & Tiedemann, Jörg. 2017. Continuous multilinguality with language vectors. In Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics (April 3–7, 2017). Volume 2: Short Papers. 644649. Valencia: Association for Computational Linguistics.CrossRefGoogle Scholar
Prokić, Jelena & Nerbonne, John. 2008. Recognising groups among dialects. International journal of humanities and arts computing 2(1/2). 153172.CrossRefGoogle Scholar
Pröll, Simon, Pickl, Simon, & Spettl, Aaron. 2014. Latente Strukturen in geolinguistischen Korpora. In Elmentaler, Michael, Hundt, Markus, Schmidt, Jürgen Erich (Hg.): Deutsche Dialekte. Konzepte, Probleme, Handlungsfelder. (ZDL. Beihefte 158), 247258. Stuttgart: Steiner.Google Scholar
Purschke, Christoph. 2018. Language regard and cultural practice: Variation, evaluation, and change in the German regional languages. In Evans, Betsy, Benson, Erica & Stanford, James (eds.), Language regard: Methods, variation, and change, 245261. Cambridge: Cambridge University Press.Google Scholar
Purschke, Christoph. 2011. Regionalsprache und Hörerurteil. Grundzüge einer perzeptiven Variationslinguistik. (ZDL. Beihefte 149). Stuttgart: Steiner.Google Scholar
Rahimi, Afshin, Baldwin, Timothy, & Cohn, Trevor. 2017a. Continuous representation of location for geolocation and lexical dialectology using mixture density networks. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing (September 9–11, 2017). 167176. Copenhagen: Association for Computational Linguistics.CrossRefGoogle Scholar
Rahimi, Afshin, Cohn, Trevor, & Baldwin, Timothy. 2017b. A neural model for user geolocation and lexical dialectology. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (July 30–August 4, 2017). Volume 2: Short Papers. 209216. Vancouver: Association for Computational Linguistics.CrossRefGoogle Scholar
Scherrer, Yves & Stöckle, Philipp. 2016. A quantitative approach to Swiss-German – Dialectometric analyses and comparisons of linguistic levels. Dialectologia et Geolinguistica 24. 92125.CrossRefGoogle Scholar
Schlobinski, Peter (ed.) 2006. Von *hdl* bis *cul8r*. Sprache und Kommunikation in den Neuen Medien. Mannheim: Duden-Verlag.Google Scholar
Schmidt, Jürgen Erich. 2010. Language and space: The linguistic dynamics approach. In Auer, Peter & Schmidt, Jürgen Erich (eds.), Language and space: An international handbook of linguistic variation. Vol. 1: Theories and Methods, 201225. (Handbooks of Linguistics and Communication Science. 30.1). Berlin/New York: De Gruyter Mouton.Google Scholar
Schümann, Michael. 2011. Hochdütsch isch en seich—Geschriebenes Schweizerdeutsch bei Twitter. In Brigitte, Ganswindt & Purschke, Christoph (eds.), Perspektiven der Variationslinguistik. Beiträge aus dem Forum Sprachvariation, 239254. (Germanistische Linguistik. 216-217). Hildesheim: Olms.Google Scholar
Shackleton, Jr, Robert, G. 2005. English-American speech relationships: A quantitative approach. Journal of English Linguistics 33(2). 99160.CrossRefGoogle Scholar
Statistisches Bundesamt. 2016. Studierende an Hochschulen. Fachserie 11 Reihe 4.1. Wintersemester 2015/2016. Wiesbaden: Statistisches Bundesamt. https://www.destatis.de/DE/Themen/Gesellschaft-Umwelt/Bildung-Forschung-Kultur/Hochschulen/_inhalt.html (14 October, 2019).Google Scholar
Stoeckle, Philipp. 2014. Subjektive Dialekträume im alemannischen Dreiländereck. (Deutsche Dialektgeographie. 112). Hildesheim, Zurich & New York: Olms.Google Scholar
Szmrecsanyi, Benedikt. 2008. Corpus-based dialectometry: Aggregate morphosyntactic variability in British English dialects. International Journal of Humanities and Arts Computing 2(1/2). 279296.CrossRefGoogle Scholar
Thurlow, Crispin & Mroczek, Kristine (eds.). 2011. Digital discourse: Language in the new media. Oxford: Oxford University Press.CrossRefGoogle Scholar
Tophinke, Doris & Ziegler, Evelyn. 2014. Spontane Dialektthematisierung in der Weblogkommunikation: Interaktiv-kontextuelle Einbettung, semantische Topoi und sprachliche Konstruktionen. In Cuonz, Christina & Studler, Rebekka (eds.), Sprechen über Sprache. Perspektiven und neue Methoden der Einstellungsforschung, 205242. Tübingen: Stauffenburg Verlag.Google Scholar
Wieling, Martijn, Nerbonne, John & Baayen, Harald. 2011. Quantitative social dialectology: Explaining linguistic variation geographically and socially. PloS ONE 6(9). e23613. https://doi.org/10.1371/journal.pone.0023613.CrossRefGoogle ScholarPubMed
Wiesinger, Peter. 1983. Die Einteilung der deutschen Dialekte. In Besch, Werner, Knoop, Ulrich, Putschke, Wolfgang & Wiegand, Herbert Ernst (eds.), Dialektologie: ein Handbuch zur deutschen und allgemeinen Dialektforschung Vol. 2, 807900. (Handbooks of Linguistics and Communication Science. 1.2). Berlin/New York: De Gruyter.Google Scholar