Hostname: page-component-cd9895bd7-dzt6s Total loading time: 0 Render date: 2025-01-05T15:00:33.916Z Has data issue: false hasContentIssue false

A Cautionary Note on using Internal Cross Validation to Select the Number of Clusters

Published online by Cambridge University Press:  01 January 2025

Abba M. Krieger
Affiliation:
Department of Statistics, University of Pennsylvania
Paul E. Green*
Affiliation:
Department of Marketing, University of Pennsylvania
*
Requests for reprints should be sent to Paul Green, Marketing Department, The Wharton School, University of Pennsylvania, 1400 Steinberg Hall-Dietrich Hall, Philadelphia PA 19104-6371.

Abstract

A highly popular method for examining the stability of a data clustering is to split the data into two parts, cluster the observations in Part A, assign the objects in Part B to their nearest centroid in Part A, and then independently cluster the Part B objects. One then examines how close the two partitions are (say, by the Rand measure). Another proposal is to split the data into k parts, and see how their centroids cluster. By means of synthetic data analyses, we demonstrate that these approaches fail to identify the appropriate number of clusters, particularly as sample size becomes large and the variables exhibit higher correlations.

Type
Original Paper
Copyright
Copyright © 1999 The Psychometric Society

Access options

Get access to the full version of this content by using one of the access options below. (Log in options will check for institutional or personal access. Content may require purchase if you do not have access.)

Footnotes

The authors express their thanks to the Sol C. Snider Entrepreneurial Center, Wharton School, for support of this project.

References

Arabie, P., & Hubert, L. W. (1994). Cluster analysis in marketing research. In Bagozzi, R. P. (Eds.), Advanced methods in marketing research (pp. 160189). Oxford: Blackwell.Google Scholar
Atlas, R. S., & Overall, J. E. (1994). comparative evaluation of two superior stopping rules for hierarchical cluster analysis. Psychometrika, 59, 581591.CrossRefGoogle Scholar
Bradley, L. A., Prokop, C. K., Margolis, R., & Gentry, W. D. (1978). Multivariate analysis of MMPI profiles of low back pain patients. Journal of Behavioral Medicine, 1, 253272.CrossRefGoogle ScholarPubMed
Breckenridge, J. N. (1989). Replicating cluster analysis: Method, consistency, and validity. Multivariate Behavioral Research, 24, 147161.CrossRefGoogle Scholar
Calinski, R. B., & Harabasz, J. (1976). A dendrite method for cluster analysis. Communications in Statistics, 3, 127.Google Scholar
Carroll, J.D. (1973). Howard-Harris clustering. In Green, P., & Wind, Y. (Eds.), Multivariate decisions in marketing (pp. 369371). Hinsdale, IL: Dryden Press.Google Scholar
Cyr, J. J., Atkinson, L., & Haley, G. A. (1986). A Replicated cluster solution in a heterogeneous psychiatric population. Journal of Clinical Psychology, 42, 9294.3.0.CO;2-2>CrossRefGoogle Scholar
Green, P. E., & Krieger, A. M. (1991). Segmenting markets with conjoint analysis. Journal of Marketing, 55, 2031.CrossRefGoogle Scholar
Helsen, K., & Green, P. E. (1991). A Computational study of replicated clustering with an application to market segmentation. Decision Science, 22, 11241141.CrossRefGoogle Scholar
Hubert, L. J., & Arabie, P. (1985). Comparing partitions. Journal of Classification, 2, 193218.CrossRefGoogle Scholar
Johnson, R. M. (1988). Convergent cluster analysis system, Ketchum, ID.: Sawtooth Software.Google Scholar
McIntyre, R. M., & Blashfield, R. K. (1980). A nearest-centroid technique for evaluating the minimum-variance clustering procedure. Multivariate Behavioral Research, 2, 225238.CrossRefGoogle Scholar
Milligan, G. W. (1980). An examination of the effect of six types of error perturbation on fifteen clustering algorithms. Psychometrika, 45, 325342.CrossRefGoogle Scholar
Milligan, G. W. (1994). Issues in applied classification: replication analysis. CSNA Newsletter, 36, 56.Google Scholar
Milligan, G. W., & Cooper, M. C. (1985). An examination of procedures for determining the number of clusters in a data set. Psychometrika, 50, 159179.CrossRefGoogle Scholar
Milligan, G. W., & Cooper, M. C. (1987). Methodology review: Clustering methods. Applied Psychological Measurement, 11, 329354.CrossRefGoogle Scholar
Overall, J. E., & Magee, K. N. (1992). Replication as a rule for determining the number of clusters in hierarchical cluster analysis. Applied Psychological Measurement, 16, 119128.CrossRefGoogle Scholar
Punj, G. N., & Stewart, D. W. (1983). Cluster analysis in marketing research: Review and suggestions. Journal of Marketing Research, 20, 134148.CrossRefGoogle Scholar
Ward, J. H. (1963). Hierarchical grouping to optimize an objective function. Journal of the American Statistical Association, 58, 236244.CrossRefGoogle Scholar