Book contents
- Frontmatter
- Contents
- Preface
- 1 Introduction: goals and methods of the corpus-based approach
- Part I Investigating the use of language features
- Part II Investigating the characteristics of varietie
- Part III Summing up and looking ahead
- Part IV Methodology boxes
- 1 Issues in corpus design
- 2 Issues in diachronic corpus design
- 3 Concordancing packages versus programming for corpus analysis
- 4 Characteristics of tagged corpora
- 5 The process of tagging
- 6 Norming frequency counts
- 7 Statistical measures of lexical associations
- 8 The unit of analysis in corpus-based studies
- 9 Significance tests and the reporting of statistics
- 10 Factor loadings and dimension scores
- Appendix: commercially available corpora and analytical tools
- References
- Index
7 - Statistical measures of lexical associations
Published online by Cambridge University Press: 05 June 2012
- Frontmatter
- Contents
- Preface
- 1 Introduction: goals and methods of the corpus-based approach
- Part I Investigating the use of language features
- Part II Investigating the characteristics of varietie
- Part III Summing up and looking ahead
- Part IV Methodology boxes
- 1 Issues in corpus design
- 2 Issues in diachronic corpus design
- 3 Concordancing packages versus programming for corpus analysis
- 4 Characteristics of tagged corpora
- 5 The process of tagging
- 6 Norming frequency counts
- 7 Statistical measures of lexical associations
- 8 The unit of analysis in corpus-based studies
- 9 Significance tests and the reporting of statistics
- 10 Factor loadings and dimension scores
- Appendix: commercially available corpora and analytical tools
- References
- Index
Summary
The simplest way to identify collocate pairs is by their relative frequency – that is, by how commonly one pair, such as “large number,” occurs relative to another pair, such as “large man.” Such frequency information can give a sense of the most common collocational associations.
However, frequency information alone may present a biased measure of the strength of associations between words. More common words are more likely to occur in a collocate pair simply by chance. Therefore, an alternative way to judge the strength of associations between words is to use measures that account for the likelihood of words occurring together by chance – i.e., statistical measures.
Statistical measures can be used to analyze both the associations between collocate pairs and the differences between the collocations of particular words. Below we briefly review a common statistical test for each of these purposes. We explain the principles behind the tests, but for details about the statistical formulas, you should consult the articles listed under “Further reading.”
Mutual information score
The mutual information score or mutual information index gives a measure of the strength of association between two words. It focuses on the likelihood of two words appearing together within a particular span of words (the span is specified for the analysis, e.g., adjacent words, a window of three words, etc.).
- Type
- Chapter
- Information
- Corpus LinguisticsInvestigating Language Structure and Use, pp. 265 - 268Publisher: Cambridge University PressPrint publication year: 1998
- 1
- Cited by