Book contents
- Frontmatter
- Contents
- From the Editors
- Notes on Contributors
- 1 Introduction: Language Variation Studies and Computational Humanities
- 2 Panel Discussion on Computing and the Humanities
- 3 Making Sense of Strange Sounds: (Mutual) Intelligibility of Related Language Varieties. A Review
- 4 Phonetic and Lexical Predictors of Intelligibility
- 5 Linguistic Determinants of the Intelligibility of Swedish Words among Danes
- 6 Mutual Intelligibility of Standard and Regional Dutch Language Varieties
- 7 The Dutch-German Border: Relating Linguistic, Geographic and Social Distances
- 8 The Space of Tuscan Dialectal Variation: A Correlation Study
- 9 Recognising Groups among Dialects
- 10 Comparison of Component Models in Analysing the Distribution of Dialectal Features
- 11 Factor Analysis of Vowel Pronunciation in Swedish Dialects
- 12 Representing Tone in Levenshtein Distance
- 13 The Role of Concept Characteristics in Lexical Dialectometry
- 14 What Role does Dialect Knowledge Play in the Perception of Linguistic Distances?
- 15 Quantifying Dialect Similarity by Comparison of the Lexical Distribution of Phonemes
- 16 Corpus-based Dialectometry: Aggregate Morphosyntactic Variability in British English Dialects
9 - Recognising Groups among Dialects
Published online by Cambridge University Press: 12 September 2012
- Frontmatter
- Contents
- From the Editors
- Notes on Contributors
- 1 Introduction: Language Variation Studies and Computational Humanities
- 2 Panel Discussion on Computing and the Humanities
- 3 Making Sense of Strange Sounds: (Mutual) Intelligibility of Related Language Varieties. A Review
- 4 Phonetic and Lexical Predictors of Intelligibility
- 5 Linguistic Determinants of the Intelligibility of Swedish Words among Danes
- 6 Mutual Intelligibility of Standard and Regional Dutch Language Varieties
- 7 The Dutch-German Border: Relating Linguistic, Geographic and Social Distances
- 8 The Space of Tuscan Dialectal Variation: A Correlation Study
- 9 Recognising Groups among Dialects
- 10 Comparison of Component Models in Analysing the Distribution of Dialectal Features
- 11 Factor Analysis of Vowel Pronunciation in Swedish Dialects
- 12 Representing Tone in Levenshtein Distance
- 13 The Role of Concept Characteristics in Lexical Dialectometry
- 14 What Role does Dialect Knowledge Play in the Perception of Linguistic Distances?
- 15 Quantifying Dialect Similarity by Comparison of the Lexical Distribution of Phonemes
- 16 Corpus-based Dialectometry: Aggregate Morphosyntactic Variability in British English Dialects
Summary
Abstract In this paper we apply various clustering algorithms to the dialect pronunciation data. At the same time we propose several evaluation techniques that should be used in order to deal with the instability of the clustering techniques. The results have shown that three hierarchical clustering algorithms are not suitable for the data we are working with. The rest of the tested algorithms have successfully detected two-way split of the data into the Eastern and Western dialects. At the aggregate level that we used in this research, no further division of sites can be asserted with high confidence.
INTRODUCTION
Dialectometry is a multidisciplinary field that uses various quantitative methods in the analysis of dialect data. Very often those techniques include classification algorithms such as hierarchical clustering algorithms used to detect groups within certain dialect area. Although known for their instability (Jain and Dubes, 1988), clustering algorithms are often applied without evaluation (Goebl, 2007; Nerbonne and Siedle, 2005) or with only partial evaluation (Moisl and Jones, 2005). Very small differences in the input data can produce substantially different grouping of dialects (Nerbonne et al., 2008). Without proper evaluation, it is very hard to determine if the results of the applied clustering technique are an artifact of the algorithm or the detection of real groups in the data.
The aim of this paper is to evaluate algorithms used to detect groups among language dialect varieties measured at the aggregate level. The data used in this research is dialect pronunciation data that consists of various pronunciations of 156 words collected all over Bulgaria.
- Type
- Chapter
- Information
- Computing and Language VariationInternational Journal of Humanities and Arts Computing Volume 2, pp. 153 - 172Publisher: Edinburgh University PressPrint publication year: 2009