Hostname: page-component-745bb68f8f-g4j75 Total loading time: 0 Render date: 2025-01-08T10:23:45.310Z Has data issue: false hasContentIssue false

Validating Clusters with the Lower Bound for Sum-of-Squares Error

Published online by Cambridge University Press:  01 January 2025

Douglas Steinley*
Affiliation:
University of Missouri-Columbia
*
Requests for reprints should be sent to Douglas Steinley, Department of Psychological Sciences, University of Missouri-Columbia, 210 McAlester Hall, Columbia, MO 65211, USA. E-mail: [email protected].

Abstract

Given that a minor condition holds (e.g., the number of variables is greater than the number of clusters), a nontrivial lower bound for the sum-of-squares error criterion in K-means clustering is derived. By calculating the lower bound for several different situations, a method is developed to determine the adequacy of cluster solution based on the observed sum-of-squares error as compared to the minimum sum-of-squares error.

Type
Original Paper
Copyright
Copyright © 2006 The Psychometric Society

Access options

Get access to the full version of this content by using one of the access options below. (Log in options will check for institutional or personal access. Content may require purchase if you do not have access.)

Footnotes

The author was partially supported by the Office of Naval Research Grant #N00014-06-0106.

References

Anderberg, M.R. (1973). Cluster analysis for applications, New York: Academic Press.Google Scholar
Brusco, M.J., Cradit, J.D. (2001). A variable-selection heuristic for K-means clustering. Psychometrika, 66, 249270.CrossRefGoogle Scholar
Cattell, R.B., Coulter, M.A. (1966). Principles of behavioral taxonomy and the mathematical basis of the taxonome computer program. British Journal of Mathematical and Statistical Psychology, 19, 237269.CrossRefGoogle ScholarPubMed
Cormack, R.M. (1971). A review of classification (with discussion). Journal of the Royal Statistical Society, Series A, 134, 321367.CrossRefGoogle Scholar
Cox, D.R. (1957). Note on grouping. Journal of the American Statistical Association, 52, 543547.CrossRefGoogle Scholar
Duda, R.O., Hart, P.E., Stork, D.G. (2001). Pattern recognition (2nd ed.),, New York: Wiley.Google Scholar
Engelman, L., Hartigan, J.A. (1969). Percentage points of a test of clusters. Journal of the American Statistical Association, 64, 16471648.CrossRefGoogle Scholar
Fan, K. (1949). On a theorem of Weyl concerning eigenvalues of linear transformations. Proceedings of the National Academy of Sciences of the United States of America, 35, 652–655.CrossRefGoogle Scholar
Fisher, R.A. (1936). The use of multiple measurements in taxonomic problems. Annual Eugenics, 7, 179188.CrossRefGoogle Scholar
Fisher, W.D. (1958). On grouping for maximum homogeneity. Journal of the American Statistical Association, 53, 789798.CrossRefGoogle Scholar
Gentle, J.E. (2002). Elements of computational statistics, New York: Springer-Verlag.Google Scholar
Gersho, A., Gray, R.M. (1992). Vector quantization and signal compression, Boston, MA: Kluwer Academic.CrossRefGoogle Scholar
Golub, G.H., Van Loan, C.F. (1996). Matrix computations (3rd ed.),, Baltimore: The Johns Hopkins University Press.Google Scholar
Gordon, A.D. (1999). Classification (2nd ed.),, New York: Chapman & Hall/CRC.CrossRefGoogle Scholar
Gordon, A.D., Henderson, J.J. (1977). An algorithm for Euclidean sum of squares classification. Biometrics, 33, 355362.CrossRefGoogle Scholar
Hartigan, J.A. (1975). Clustering algorithms, New York: Wiley.Google Scholar
Hubert, L., Arabie, P. (1985). Comparing partitions. Journal of Classification, 2, 193218.CrossRefGoogle Scholar
Hubert, L.J., Arabie, P., Meulman, J. (2001). Combinatorial data analysis: Optimization by dynamic programming, Philadelphia: SIAM.CrossRefGoogle Scholar
Johnson, R.A., Wichern, D.W. (2002). Applied multivariate statistical analysis (5th ed.),, Upper Saddle River, NJ: Prentice Hall.Google Scholar
Kaufman, L., Rousseeuw, P.J. (1987). Clustering by means of medoids. In Dodge, Y. (Eds.), Statistical data analysis based on the L 1-norm and related methods (pp. 405416). Amsterdam: Elsevier Science.Google Scholar
Kaufman, L., Rousseeuw, P.J. (1990). Finding groups in data: An introduction to cluster analysis, New York: Wiley.CrossRefGoogle Scholar
Lattin, J., Carroll, J.D., Green, P.E. (2003). Analyzing multivariate data, Pacific Grove, CA: Brooks/Cole.Google Scholar
MacQueen, J. (1967). Some methods of classification and analysis of multivariate observations. In L.M. Le Cam, & J. Neyman (Eds.), Proceedings of the fifth Berkeley symposium on mathematical statistics and probability (Vol. 1, pp.~281–297). Berkeley, CA: University of California Press.Google Scholar
Milligan, G.W. (1985). An algorithm for generating artificial test clusters. Psychometrika, 50, 123127.CrossRefGoogle Scholar
Milligan, G.W., Cooper, M.C. (1985). An examination of procedures for determining the number of clusters in a data set. Psychometrika, 50, 159179.CrossRefGoogle Scholar
Milligan, G.W., Cooper, M.C. (1988). A study of standardization of variables in cluster analysis. Journal of Classification, 5, 181204.CrossRefGoogle Scholar
Sebestyen, G.S. (1962). Decision making process in pattern recognition, New York: Macmillan.Google Scholar
Späth, H. (1980). Cluster analysis algorithms for data reduction and classification of objects, New York: Wiley.Google Scholar
Steinley, D. (2003). K-means clustering: What you don't know may hurt you. Psychological Methods, 8, 294304.CrossRefGoogle ScholarPubMed
Steinley, D. (2004a). Standardizing variables in K-means clustering. In Banks, D., House, L., McMorris, F.R., Arabie, P., Gaul, W. (Eds.), Classification, clustering, and data mining applications (pp. 5360). New York: Springer-Verlag.CrossRefGoogle Scholar
Steinley, D. (2004b). Properties of the Hubert–Arabie adjusted Rand index. Psychological Methods, 9, 386396.CrossRefGoogle ScholarPubMed
Steinley, D. (in press). K-means clustering: A half-century synthesis. British Journal of Mathematical and Statistical Psychology.Google Scholar
Steinley, D., Henson, R. (2005). OCLUS: An analytic method for generating clusters with known overlap. Journal of Classification, 22, 221250.CrossRefGoogle Scholar
Thorndike, R.L. (1953). Who belongs in the family?. Psychometrika, 18, 267276.CrossRefGoogle Scholar
Tibshirani, R., Walther, G., Hastie, T. (2001). Estimating the number of clusters in a data set via the gap statistic. Journal of the Royal Statistical Society, Series B, 63, 411423.CrossRefGoogle Scholar
Timm, N.H. (2002). Applied multivariate analysis, New York: Springer-Verlag.Google Scholar
van Os, B.J. (2000). Dynamic programming for partitioning in multivariate data analysis, Leiden: University Press.Google Scholar
Waller, N.G., Underhill, J.M., Kaiser, H.A. (1999). A method for generating simulated plasmodes and artificial test clusters with user-defined shape, size, and orientation. Multivariate Behavioral Research, 34, 123142.CrossRefGoogle ScholarPubMed
Weisstein, E.W. (2003). CRC concise encyclopedia of mathematics, Boca Raton, FL: Champman & Hall.Google Scholar
Zha, H., Ding, C., Gu, M., He, X., Simon, H.D. (2001). Spectral relaxation for K-means clustering. Nueral Information Processsing Systems, 14, 10571064.Google Scholar