Book contents
- Frontmatter
- Contents
- Contributors
- Preface
- 1 Scaling Up Machine Learning: Introduction
- Part One Frameworks for Scaling Up Machine Learning
- Part Two Supervised and Unsupervised Learning Algorithms
- 6 PSVM: Parallel Support Vector Machines with Incomplete Cholesky Factorization
- 7 Massive SVM Parallelization Using Hardware Accelerators
- 8 Large-Scale Learning to Rank Using Boosted Decision Trees
- 9 The Transform Regression Algorithm
- 10 Parallel Belief Propagation in Factor Graphs
- 11 Distributed Gibbs Sampling for Latent Variable Models
- 12 Large-Scale Spectral Clustering with Map Reduce and MPI
- 13 Parallelizing Information-Theoretic Clustering Methods
- Part Three Alternative Learning Settings
- Part Four Applications
- Subject Index
- References
12 - Large-Scale Spectral Clustering with Map Reduce and MPI
from Part Two - Supervised and Unsupervised Learning Algorithms
Published online by Cambridge University Press: 05 February 2012
- Frontmatter
- Contents
- Contributors
- Preface
- 1 Scaling Up Machine Learning: Introduction
- Part One Frameworks for Scaling Up Machine Learning
- Part Two Supervised and Unsupervised Learning Algorithms
- 6 PSVM: Parallel Support Vector Machines with Incomplete Cholesky Factorization
- 7 Massive SVM Parallelization Using Hardware Accelerators
- 8 Large-Scale Learning to Rank Using Boosted Decision Trees
- 9 The Transform Regression Algorithm
- 10 Parallel Belief Propagation in Factor Graphs
- 11 Distributed Gibbs Sampling for Latent Variable Models
- 12 Large-Scale Spectral Clustering with Map Reduce and MPI
- 13 Parallelizing Information-Theoretic Clustering Methods
- Part Three Alternative Learning Settings
- Part Four Applications
- Subject Index
- References
Summary
Spectral clustering is a technique for finding group structure in data. It makes use of the spectrum of the data similarity matrix to perform dimensionality reduction for clustering in fewer dimensions. Spectral clustering algorithms have been shown to be more effective in finding clusters than traditional algorithms such as k-means. However, spectral clustering suffers from a scalability problem in both memory use and computation time when the size of a dataset is large. To perform clustering on large datasets, in this work, we parallelize both memory use and computation using MapReduce and MPI. Through an empirical study on a document set of 534,135 instances and a photo set of 2,121,863 images, we show that our parallel algorithm can effectively handle large problems.
Clustering is one of the most important subfields of machine learning and data mining tasks. In the last decade, spectral clustering (e.g., Shi and Malik, 2000; Meila and Shi, 2000; Fowlkes et al., 2004), motivated by normalized graph cut, has attracted much attention. Unlike traditional partition-based clustering, spectral clustering exploits a pairwise data similarity matrix. It has been shown to be more effective than traditional methods such as k-means, which considers only the similarity between instances and k centroids (Ng, Jordan, and Weiss, 2001).Because of its effectiveness, spectral clustering has been widely used in several areas such as information retrieval and computer vision (e.g., Dhillon, 2001; Xu, Liu, and Gong, 2003; Shi and Malik, 2000; Yu and Shi, 2003).
- Type
- Chapter
- Information
- Scaling up Machine LearningParallel and Distributed Approaches, pp. 240 - 261Publisher: Cambridge University PressPrint publication year: 2011
References
- 1
- Cited by