Large-Scale Spectral Clustering with Map Reduce and MPI

doi:10.1017/CBO9781139042918.013

12 - Large-Scale Spectral Clustering with Map Reduce and MPI

from Part Two - Supervised and Unsupervised Learning Algorithms

Published online by Cambridge University Press: 05 February 2012

Chih-Jen Lin and

Edited by

Mikhail Bilenko and

Wen-Yen Chen: Affiliation:
University of California
Yangqiu Song: Affiliation:
Tsinghua University
Hongjie Bai: Affiliation:
Google Research, Beijing, China
Chih-Jen Lin: Affiliation:
National Taiwan University
Edward Y. Chang: Affiliation:
Google Research, Beijing, China
Ron Bekkerman: Affiliation:
LinkedIn Corporation, Mountain View, California
Mikhail Bilenko: Affiliation:
Microsoft Research, Redmond, Washington
John Langford: Affiliation:
Yahoo! Research, New York

Book contents

Get access

Summary

Spectral clustering is a technique for finding group structure in data. It makes use of the spectrum of the data similarity matrix to perform dimensionality reduction for clustering in fewer dimensions. Spectral clustering algorithms have been shown to be more effective in finding clusters than traditional algorithms such as k-means. However, spectral clustering suffers from a scalability problem in both memory use and computation time when the size of a dataset is large. To perform clustering on large datasets, in this work, we parallelize both memory use and computation using MapReduce and MPI. Through an empirical study on a document set of 534,135 instances and a photo set of 2,121,863 images, we show that our parallel algorithm can effectively handle large problems.

Clustering is one of the most important subfields of machine learning and data mining tasks. In the last decade, spectral clustering (e.g., Shi and Malik, 2000; Meila and Shi, 2000; Fowlkes et al., 2004), motivated by normalized graph cut, has attracted much attention. Unlike traditional partition-based clustering, spectral clustering exploits a pairwise data similarity matrix. It has been shown to be more effective than traditional methods such as k-means, which considers only the similarity between instances and k centroids (Ng, Jordan, and Weiss, 2001).Because of its effectiveness, spectral clustering has been widely used in several areas such as information retrieval and computer vision (e.g., Dhillon, 2001; Xu, Liu, and Gong, 2003; Shi and Malik, 2000; Yu and Shi, 2003).

Type: Chapter
Information: Scaling up Machine Learning
Parallel and Distributed Approaches
, pp. 240 - 261

DOI: https://doi.org/10.1017/CBO9781139042918.013 [Opens in a new window]

Publisher: Cambridge University Press

Print publication year: 2011

Access options

Get access to the full version of this content by using one of the access options below. (Log in options will check for institutional or personal access. Content may require purchase if you do not have access.)

References

Achlioptas, D., McSherry, F., and Schölkopf, B. 2002. Sampling Techniques for Kernel Methods. Pages 335–342 of: Proceedings of NIPS.Google Scholar

Barnett, M., Gupta, S., Payne, D. G., Shuler, L., Geijn, R., and Watts, J. 1994. Interprocessor Collective Communication Library (InterCom). Pages 357–364 of: Proceedings of the Scalable High Performance Computing Conference.CrossRef Google Scholar

Bekkerman, R., and Scholz, M. 2008. Data Weaving: Scaling Up the State-of-the-Art in Data Clustering. Pages 1083–1092 of: Proceedings of CIKM.Google Scholar

Bentley, J. L. 1975. Multidimensional Binary Search Trees Used for Associative Searching. Communications of the ACM, 18(9), 509–517.CrossRef Google Scholar

Chang, F., Dean, J., Ghemawat, S., Hsieh, W. C., Wallach, D. A., Burrows, M., Chandra, T., Fikes, A., and Gruber, R. E. 2006. Bigtable: A Distributed Storage System for Structured Data. Pages 205–218 of: Proceedings of OSDI.Google Scholar

Chen, W.-Y., Song, Y., Bai, H., Lin, C.-J., and Chang, E. Y. 2011. Parallel Spectral Clustering in Distributed Systems. IEEE Transactions on Pattern Analysis and Machine Intelligence, 33(3), 568–586.CrossRef Google Scholar PubMed

Chu, C.-T., Kim, S. K., Lin, Y.-A., Yu, Y., Bradski, G., Ng, A. Y., and Olukotun, K. 2007. Map-Reduce for Machine Learning on Multicore. Pages 281–288 of: Proceedings of NIPS.Google Scholar

Chung, F. 1997. Spectral Graph Theory. Number 92 in CBMS Regional Conference Series in Mathematics. American Mathematical Society.Google Scholar

Dean, J., and Ghemawat, S. 2008. MapReduce: Simplified Data Processing on Large Clusters. Communications of the ACM, 51(1), 107–113.CrossRef Google Scholar

Dhillon, I. S. 2001. Co-clustering Documents and Words Using Bipartite Spectral Graph Partitioning. Pages 269–274 of: Proceedings of SIGKDD.Google Scholar

Dhillon, I. S., and Modha, D. S. 1999. A Data-Clustering Algorithm on Distributed Memory Multiprocessors. Pages 245–260 of: Large-Scale Parallel Data Mining.Google Scholar

Fowlkes, C., Belongie, S., Chung, F., and Malik, J. 2004. Spectral Grouping Using the Nyström Method. IEEE Transactions on Pattern Analysis and Machine Intelligence, 26(2), 214–225.CrossRef Google Scholar PubMed

Ghemawat, S., Gobioff, H., and Leung, S.-T. 2003. The Google File System. Pages 29–43 of: Proceedings of SOSP. New York: ACM.Google Scholar

Gionis, A., Indyk, P., and Motwani, R. 1999. Similarity Search in High Dimensions via Hashing. Pages 518–529 of: Proceedings of VLDB.Google Scholar

Grama, A., Karypis, G., Kumar, V., and Gupta, A. 2003. Introduction to Parallel Computing, 2nd ed. Reading, MA: Addison-Wesley.Google Scholar

Gropp, W., Lusk, E., and Skjellum, A. 1999. Using MPI-2: Advanced Features of the Message-Passing Interface. Cambridge, MA: MIT Press.Google Scholar

Gürsoy, A. 2003. Data Decomposition for Parallel k-Means Clustering. Pages 241–248 of: PPAM.Google Scholar

Hernandez, V., Roman, J. E., Tomas, A., and Vidal, V. 2005a. A Survey of Software for Sparse Eigenvalue Problems. Technical Report. Universidad Politecnica de Valencia.Google Scholar

Hernandez, V., Roman, J. E., and Vidal, V. 2005b. SLEPc: A Scalable and Flexible Toolkit for the Solution of Eigenvalue Problems. ACM Transactions on Mathematical Software, 31, 351–362.CrossRef Google Scholar

Lang, Ken. 1995. NewsWeeder: Learning to Filter Netnews. Pages 331–339 of: Proceedings of ICML.Google Scholar

Lehoucg, R. B., Sorensen, D. C., and Yang, C. 1998. ARPACK User's Guide. SIAM.CrossRef Google Scholar

Lewis, D. D., Yang, Y., Rose, T. G., and Li, F. 2004. RCV1: A New Benchmark Collection for Text Categorization Research. Journal of Machine Learning Research, 5, 361–397.Google Scholar

Li, B., Chang, E. Y., andWu, Y.-L. 2003. Discovery of a Perceptual Distance Function for Measuring Image Similarity. Multimedia Systems, 8(6), 512–522.CrossRef Google Scholar

Liu, R., and Zhang, H. 2004. Segmentation of 3D meshes through spectral clustering. In: Proceedings of Pacific Graphics.Google Scholar

Liu, T., Moore, A., Gray, A., and Yang, K. 2004. An Investigation of Practical Approximate Nearest Neighbor Algorithms. In: Proceedings of NIPS.Google Scholar

Llorente, I. M., Tirado, F., and Vázquez, L. 1996. Some Aspects about the Scalability of Scientific Applications on Parallel Architectures. Parallel Computing, 22(9), 1169–1195.CrossRef Google Scholar

Luxburg, U. 2007. A Tutorial on Spectral Clustering. Statistics and Computing, 17(4), 395–416.CrossRef Google Scholar

Marques, O. A. 1995. BLZPACK: Description and User's Guide. Technical Report TR/PA/95/30. CERFACS, Toulouse, France.Google Scholar

Maschhoff, K., and Sorensen, D. 1996. A Portable Implementation of ARPACK for Distributed Memory Parallel Architectures. In: Proceedings of CMCIM.Google Scholar

Meila, M., and Shi, J. 2000. Learning Segmentation by Random Walks. Pages 873–879 of: Proceedings of NIPS.Google Scholar

Micchelli, Charles A. 1986. Interpolation of Scattered Data: Distance Matrices and Conditionally Positive Definite Functions. Constructive Approximation, 2, 11–22.CrossRef Google Scholar

Ng, A. Y., Jordan, M. I., and Weiss, Y. 2001. On Spectral Clustering: Analysis and an Algorithm. Pages 849–856 of: Proceedings of NIPS.Google Scholar

Rennie, J. D. M. 2001. Improving Multi-class Text Classification with Naive Bayes. M.Phil. thesis, Massachusetts Institute of Technology.

Shi, J., and Malik, J. 2000. Normalized Cuts and Image Segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 22(8), 888–905.Google Scholar

Snir, M., and Otto, S. 1998. MPI – The Complete Reference: The MPI Core. Cambridge, MA: MIT Press.Google Scholar

Strehl, A., and Ghosh, J. 2002. Cluster Ensembles – A Knowledge Reuse Framework for Combining Multiple Partitions. Journal of Machine Learning Research, 3, 583–617.Google Scholar

Thakur, R., Rabenseinfer, R., and Gropp, W. 2005. Optimization of Collective Communication Operations in MPICH. International Journal of High Performance Computing Applications, 19(1), 49–66.CrossRef Google Scholar

Wu, K., and Simon, H. 1999. TRLAN User Guide. Technical report. LBNL-41284. Lawrence Berkeley National Laboratory.CrossRef Google Scholar

Xu, S. T., and Zhang, J. 2004. A Hybrid Parallel Web Document Clustering Algorithm and Its Performance Study. Journal of Supercomputing, 30(2), 117–131.CrossRef Google Scholar

Xu, W., Liu, X., and Gong, Y. 2003. Document Clustering Based on Non-negative Matrix Factorization. Pages 267–273 of: Proceedings of SIGIR.Google Scholar

Yan, D., Huang, L., and Jordan, M. I. 2009. Fast Approximate Spectral Clustering. Pages 907–916 of: Proceedings of the 15th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining.CrossRef Google Scholar

Yu, S. X., and Shi, J. 2003. Multiclass Spectral Clustering. Page 313 of: Proceedings of ICCV.Google Scholar

Zelnik-Manor, L., and Perona, P. 2005. Self-Tuning Spectral Clustering. Pages 1601–1608 of: Proceedings of NIPS.Google Scholar

Zhong, S., and Ghosh, J. 2003. A Unified Framework for Model-Based Clustering. Journal of Machine Learning Research, 4, 1001–1037.Google Scholar

Book contents

12 - Large-Scale Spectral Clustering with Map Reduce and MPI

Summary

Access options

References

Save book to Kindle

Save book to Dropbox

Save book to Google Drive