Hostname: page-component-78c5997874-ndw9j Total loading time: 0 Render date: 2024-11-06T10:01:54.475Z Has data issue: false hasContentIssue false

Silhouette + attraction: A simple and effective method for text clustering

Published online by Cambridge University Press:  14 August 2015

MARCELO L. ERRECALDE
Affiliation:
LIDIC, Universidad Nacional de San Luis, San Luis, Argentina e-mail: [email protected], [email protected]
LETICIA C. CAGNINA
Affiliation:
LIDIC, Universidad Nacional de San Luis, San Luis, Argentina e-mail: [email protected], [email protected]
PAOLO ROSSO
Affiliation:
NLE Lab, PRHLT Research Center, Universitat Politècnica de València, Valencia, España e-mail: [email protected]

Abstract

This article presents silhouette–attraction (Sil–Att), a simple and effective method for text clustering, which is based on two main concepts: the silhouette coefficient and the idea of attraction. The combination of both principles allows us to obtain a general technique that can be used either as a boosting method, which improves results of other clustering algorithms, or as an independent clustering algorithm. The experimental work shows that Sil–Att is able to obtain high-quality results on text corpora with very different characteristics. Furthermore, its stable performance on all the considered corpora is indicative that it is a very robust method. This is a very interesting positive aspect of Sil–Att with respect to the other algorithms used in the experiments, whose performances heavily depend on specific characteristics of the corpora being considered.

Type
Articles
Copyright
Copyright © Cambridge University Press 2015 

Access options

Get access to the full version of this content by using one of the access options below. (Log in options will check for institutional or personal access. Content may require purchase if you do not have access.)

Footnotes

This research work has been partially funded by UNSL, CONICET (Argentina), DIANA-APPLICATIONS-Finding Hidden Knowledge in Texts: Applications (TIN2012-38603-C02-01) research project, and the WIQ-EI IRSES project (grant no. 269180) within the FP 7 Marie Curie People Framework on Web Information Quality Evaluation Initiative. The work of the third author was done also in the framework of the VLC/CAMPUS Microcluster on Multimodal Interaction in Intelligent Systems.

References

Alexandrov, M., Gelbukh, A., and Rosso, P. 2005. An approach to clustering abstracts. In Montoyo, A., Muñoz, R., and Métais, E. (eds.), Natural Language Processing and Information Systems, vol. 3513, pp. 275–85. Lecture Notes in Computer Science. Germany: Springer-Verlag Berlin Heidelberg,.Google Scholar
Aranganayagi, S., and Thangavel, K., 2007. Clustering categorical data using silhouette coefficient as a relocating measure. In Proceedings of the International Conference on Computational Intelligence and Multimedia Applications (ICCIMA-2007), vol. 02, California, USA: IEEE Computer Society, pp. 13–7.Google Scholar
Azzag, H., Monmarche, N., Slimane, M., Venturini, G., and Guinot, C., 2003. AntTree: a new model for clustering with artificial ants. In Proceedings of the IEEE Conference on Evolutionary Computation (CEC-2003), vol. 04, New Jersey, USA: IEEE Press, pp. 2642–7.Google Scholar
Banerjee, S., Ramanathan, K., and Gupta, A., 2007. Clustering short texts using Wikipedia. In Proceedings of the 30th Annual International ACM Conference on Research and Development in Information Retrieval (SIGIR-2007), New York, USA: ACM, pp. 787–8.Google Scholar
Barnette, J. J., and McLean, J. E., 1998. The Tukey honestly significant difference procedure and its control of the type I error-rate. In Proceedings of 27 Annual Meeting of the Mid-South Educational Research Association, John Petry. Louisiana, USA, pp. 215.Google Scholar
Berry, M. W. (ed.), 2003. Survey of Text Mining: Clustering, Classification, and Retrieval. New York, USA: Springer Science+Business Media, Inc.Google Scholar
Bezdek, J. C., and Pal, N. R., 1995. Cluster validation with generalized Dunn’s indices. In Proceedings of the Second New Zealand International Two-Stream Conference on Artificial Neural Networks and Expert Systems, California, USA: IEEE Computer Society Press, pp. 190–3.Google Scholar
Bonato dos Santos, J., Heuser, C., Pereira Moreira, V., and Krug Wives, L., 2011. Automatic threshold estimation for data matching applications. Information Sciences 181 (13): 2685–99.Google Scholar
Brun, M., Sima, C., Hua, J., Lowey, J., Carroll, B., Suh, E., and Dougherty, E. R., 2007. Model-based evaluation of clustering validation measures. Pattern Recognition 40 (3): 807–24.CrossRefGoogle Scholar
Cagnina, L., Errecalde, M., Ingaramo, D., and Rosso, P., 2008. A discrete particle swarm optimizer for clustering short-text corpora. In Proceedings of the International Conference on Bioinspired Optimization Methods and their Applications (BIOMA08), Ljubljana, Slovenia: Jozef Stefan Institute, pp. 93103.Google Scholar
Cagnina, L., Errecalde, M., Ingaramo, D., and Rosso, P., 2014. An efficient particle swarm optimization approach to cluster short texts. Information Sciences 265 : 3649.CrossRefGoogle Scholar
Carullo, M., Binaghi, E., and Gallo, I., 2009. An online document clustering technique for short web contents. Pattern Recognition Letters 30 (10): 870–6.Google Scholar
Charikar, M., Chekuri, C., Feder, T., and Motwani, R., 2004. Incremental clustering and dynamic information retrieval. SIAM Journal on Computing 33 (6): 1417–40.Google Scholar
Choi, M. J., Tan, V. Y. F., Anandkumar, A., and Willsky, A. S., 2011. Learning latent tree graphical models. Journal of Machine Learning Research 12 : 1771–812.Google Scholar
Cutting, D. R., Karger, D. R., Pedersen, J. O., and Tukey, J. W., 1992. Scatter/gather: a cluster-based approach to browsing large document collections. In Proceedings of the 15th Annual International ACM Conference on Research and Development in Information Retrieval (SIGIR 2002), Interface Design and Display, New York, USA: ACM, pp. 318–29.Google Scholar
Davies, D. L., and Bouldin, D. W., 1979. A cluster separation measure. IEEE Transactions on Analysis and Machine Intelligence 1 (2): 224–7.Google Scholar
Dunn, J. C., 1974. Well separated clusters and optimal fuzzy-partitions. Journal of Cybernetics 4 (1): 95104.Google Scholar
Errecalde, M., Ingaramo, D., and Rosso, P., 2008. Proximity estimation and hardness of short-text corpora. In 19th International Workshop on Database and Expert Systems Application, California, USA: IEEE Computer Society, pp. 15–9.Google Scholar
Errecalde, M., Ingaramo, D., and Rosso, P., 2010. ITSA*: an effective iterative method for short-text clustering tasks. In Proceedings of the 23rd International Conference on Industrial Engineering and other Applications of Applied Intelligent Systems, IEA/AIE 2010, vol. 6096, Lecture Notes in Computer Science, Germany: Springer-Verlag Berlin, Heidelberg., pp. 550–9.Google Scholar
Fisher, R., 1925. Statistical Methods for Research Workers. Edinburgh, Scotland: Oliver & Boyd.Google Scholar
Hasan, M. A., Chaoji, V., Salem, S., and Zaki, M. J., 2009. Robust partitional clustering by outlier and density insensitive seeding. Pattern Recognition Letters 30 (11): 9941002.Google Scholar
He, H., Chen, B., Xu, W., and Guo, J., 2007. Short text feature extraction and clustering for web topic mining. In Proceedings of the 3rd International Conference on Semantics, Knowledge and Grid, California, USA: IEEE Computer Society, pp. 382–5.Google Scholar
Hearst, M. A., 2006. Clustering versus faceted categories for information exploration. Communications of the ACM 49 (4): 5961.Google Scholar
Hu, X., Sun, N., Zhang, C., and Chua, T., 2009. Exploiting internal and external semantics for the clustering of short texts using world knowledge. In Proceedings of the 18th ACM Conference on Information and Knowledge Management, New York, USA: ACM, pp. 919–28.CrossRefGoogle Scholar
Ingaramo, D., Errecalde, M., Cagnina, L., and Rosso, P. 2009. Particle Swarm Optimization for clustering short-text corpora. In Computational Intelligence and Bioengineering, pp. 319. Amsterdam, The Netherlands: IOS Press.Google Scholar
Ingaramo, D., Errecalde, M., and Rosso, P. 2010a. A general bio-inspired method to improve the short-text clustering task. In Proceedings of 11th International Conference on Intelligent Text Processing and Computational Linguistics, CICLing 2010, vol. 6008, pp. 661–72. Lecture Notes in Computer Science. Germany: Springer-Verlag Berlin Heidelberg.Google Scholar
Ingaramo, D., Errecalde, M., and Rosso, P., 2010b. A new AntTree-based algorithm for clustering short-text corpora. Journal of Computer Science & Technology 10 (1): 17.Google Scholar
Ingaramo, D., Leguizamón, G., and Errecalde, M., 2005. Adaptive clustering with artificial ants. Journal of Computer Science & Technology 5 (4): 264–71.Google Scholar
Ingaramo, D., Pinto, D., Rosso, P., and Errecalde, M. 2008. Evaluation of internal validity measures in short-text corpora. In Proceedings of the International Conference on Intelligent Text Processing and Computational Linguistics, CICLing 2008, vol. 4919, pp. 555–67. Lecture Notes in Computer Science. Germany: Springer-Verlag Berlin Heidelberg.Google Scholar
Ingaramo, D., Rosas, M. V., Errecalde, M., and Rosso, P., 2011. Clustering iterativo de textos cortos con representaciones basadas en concepto. Procesamiento del Lenguaje Natural 46 : 1926.Google Scholar
Jiang, D., Pei, J., and Zhang, A., 2003. DHC: a density-based hierarchical clustering method for time series gene expression data. In Proceedings 3rd IEEE Symposium on BioInformatics and BioEngineering (BIBE 2003), California, USA: IEEE Computer Society, pp. 393400.Google Scholar
Jing, L. 2005. Survey of text clustering. Technical report. Department of Mathematics. The University of Hong Kong, Hong Kong, China.Google Scholar
Karthikeyan, T., Peter, S. J., and Chidambaranathan, S., 2011. Hybrid algorithm for noise-free high density clusters with self-detection of best number of clusters. International Journal of Hybrid Information Technology 4 (2): 3954.Google Scholar
Karypis, G., Han, E., and Kumar, V., 1999. Chameleon: hierarchical clustering using dynamic modeling. IEEE Computer 32 (8): 6875.Google Scholar
Kaufman, L., and Rousseeuw, P. J. 1990. Finding Groups in Data: An Introduction to Cluster Analysis, New York, USA: John Wiley and Sons.Google Scholar
Kruskal, W. H., and Wallis, W. A., 1952. Use of ranks in one-criterion variance analysis. Journal of the American Statistical Association 47 (260): 583621.Google Scholar
Kyriakopoulou, A. 2008. Text classification aided by clustering: a literature review. In Tools in Artificial Intelligence, pp. 233–52. Rijeka, Croatia: INTECH.Google Scholar
Larocca Neto, J., Santos, A. D., Kaestner, C. A. A., and Freitas, A. A. 2000. Document clustering and text summarization. In Mackin, N. (ed.), Proceedings of the 4th International Conference Practical Applications of Knowledge Discovery and Data Mining (PADD-2000), London, England: The Practical Application Company, pp. 4155.Google Scholar
Levene, H. 1960. Robust tests for equality of variances. In Contributions to Probability and Statistics, pp. 278–92. California, USA: Stanford University Press.Google Scholar
Liu, X., and Croft, W. B. 2004. Cluster-based retrieval using language models. In Sanderson, M., Jrvelin, K., Allan, J., and Bruza, P. (eds.), Proceedings of the 27th annual International ACM Conference on Research and Development in Information Retrieval (SIGIR 2004), New York, USA: ACM, pp. 186–93.Google Scholar
MacQueen, J. B., 1967. Some methods for classification and analysis of multivariate observations. In Proceedings of 5th Berkeley Symposium on Mathematical Statistics and Probability, vol. 1, California, USA: University of California Press, pp. 281–97.Google Scholar
Makagonov, P., Alexandrov, M., and Gelbukh, A. 2004. Clustering abstracts instead of full texts. In Text, Speech and Dialogue, 10th International Conference (TSD 2004), vol. 3206, pp. 129–35. Lecture Notes in Artificial Intelligence, Germany: Springer-Verlag Berlin Heidelberg.Google Scholar
Manning, C. D., Raghavan, P., and Schutze, H., 2008. Introduction to Information Retrieval. Cambridge, England: Cambridge University Press.Google Scholar
Ng, A. Y., Jordan, M. I., and Weiss, Y. 2001. On spectral clustering: analysis and an algorithm. In The Proceedings of the 2001 Neural Information Processing Systems (NIPS) Conference. vol. 14, pp. 849–56. Advances in Neural Information Processing Systems. Massachusetts, USA: MIT Press.Google Scholar
Ng, R. T., and Han, J., 1994. Efficient and effective clustering methods for spatial data mining. In Proceedings of the 20th International Conference on Very Large Data Bases (VLDB ’94), California, USA: Morgan Kaufmann Publishers Inc., pp. 144–55.Google Scholar
Paterlini, A. A., Nascimento, M. A., and Traina, C. Jr, 2011. Using pivots to speed-up k-medoids clustering. Journal of Information and Data Management 2 (2): 221–36.Google Scholar
Pinto, D., Benedí, J. M., and Rosso, P. 2007. Clustering narrow-domain short texts by using the Kullback-Leibler distance. In Computational Linguistics and Intelligent Text Processing, 8th International Conference (CICLing 2007), vol. 4394, pp. 611–22. Lecture Notes in Computer Science, Germany: Springer-Verlag Berlin Heidelberg.Google Scholar
Pinto, D., and Rosso, P. 2007. On the relative hardness of clustering corpora. In Text, Speech and Dialogue, 10th International Conference (TSD 2007), vol. 4629, pp. 155–61. Lecture Notes in Computer Science, Germany: Springer-Verlag Berlin Heidelberg.CrossRefGoogle Scholar
Pons-Porrata, A., Berlanga-Llavori, R., and Ruiz-Shulcloper, J., 2007. Topic discovery based on text mining techniques. Information Processing and Management 43 (3): 752–68.Google Scholar
Popova, S., and Khodyrev, I. 2011. Local theme detection and annotation with keywords for narrow and wide domain short text collections. In Proceedings of the 5th International Conference on Advances in Semantic Processing (SEMAPRO 2011). Delaware, USA: International Academy, Reseach, and Industry Association (IARIA), pp. 4955.Google Scholar
Qi, G.-J., Aggarwal, C., and Huang, T., 2012. Community detection with edge content in social media networks. In IEEE 28th International Conference on Data Engineering (ICDE 2012), California, USA: IEEE Computer Society - Conference Publishing Services (CPS), pp. 534–45.Google Scholar
Rousseeuw, P., 1987. Silhouettes: a graphical aid to the interpretation and validation of cluster analysis. Journal of Computational and Applied Mathematics 20 (1): 5365.Google Scholar
Selim, S. Z., and Alsultan, K., 1991. A simulated annealing algorithm for the clustering problem. Pattern Recognition 24 (10): 1003–8.Google Scholar
Shannon, C. E., 1948. A mathematical theory of communication. The Bell system technical journal 27 : 379423.CrossRefGoogle Scholar
Slonim, N., Friedman, N., and Tishby, N., 2002. Unsupervised document classification using sequential information maximization. In Proceedings of the 25th Annual International ACM Conference on Research and Development in Information Retrieval (SIGIR 2002), New York, USA: ACM, pp. 129–36.Google Scholar
Spearman, C., 1904. The proof and measurement of association between two things. The American Journal of Psychology 15 (1): 72101.Google Scholar
Stein, B., and Busch, M., 2005. Density-based cluster algorithms in low-dimensional and high-dimensional applications. In 2nd International Workshop on Text-Based Information Retrieval (TIR 2005), Fachberichte Informatik, Mainz, Germany: University of Koblenz-Landau, pp. 4556.Google Scholar
Stein, B., and Meyer zu Eissen, S. 2002. Document categorization with MajorClust. In Proceedings Workshop on Issues in the Theory of Security (WITS 2002), Barcelona, Spain: Technical University of Barcelona, pp. 91–6.Google Scholar
Stein, B., and Meyer zu Eissen, S., 2004. Topic identification: framework and application. In Proceedings of the International Conference on Knowledge Management (I-KNOW 2004), Graz, Austria: Know-Center, pp. 353–60.Google Scholar
Stein, B., Meyer zu Eissen, S., and Wißbrock, F., 2003. On cluster validity and the information need of users. In 3rd International Conference on Artificial Intelligence and Applications (AIA 2003), Benalmádena, Spain: ACTA Press, pp. 216–21.Google Scholar
Steinbach, M., Karypis, G., and Kumar, V., 2000. A comparison of document clustering techniques. In Working Notes - Workshop on Text Mining. The 6th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Massachusetts, USA, pp. 109–11.Google Scholar
Steinberger, R., Pouliquen, B., Widiger, A., Ignat, C., Erjavec, T., Tufis, D., and Varga, D., 2006. The JRC-Acquis: a multilingual aligned parallel corpus with 20+ languages. In Proceedings of the 5th International Conference on Language Resources and Evaluation (LREC 2006), Paris, France: European Language Resources Association (ELRA), pp. 2142–7.Google Scholar
Takeda, T., and Takasu, A., 2007. Updatenews: a news clustering and summarization system using efficient text processing. In Proceedings of the 7th ACM/IEEE CS Joint Conference on Digital Libraries, New York, USA: ACM, pp. 438–9.Google Scholar
Tan, P. N., Steinbach, M., and Kumar, V., 2005. Introduction to Data Mining. Massachusetts, USA: Addison-Wesley Longman Publishing Co..Google Scholar
Tu, L., and Chen, Y., 2009. Stream data clustering based on grid density and attraction. ACM Transactions on Knowledge Discovery from Data 3 : 127.Google Scholar
Tukey, J. W. 1953. The problem of multiple comparisons. In The Collected Works of John W. Tukey VIII. Multiple Comparisons: 1948–1983, pp. 1300. New York, USA: Chapman and Hall.Google Scholar
Tukey, J. W., 1977. Exploratory Data Analysis. Massachusetts, USA: Pearson Education, Inc.Google Scholar
van Rijsbergen, C. J. 1979. Information Retrieval, 2 edn. London, England: Butterworths.Google Scholar
Xu, W., Liu, X., and Gong, Y. 2003. Document clustering based on non-negative matrix factorization. In Proceedings of the 26th Annual International ACM Conference on Research and Development in Information Retrieval (SIGIR 2003). New York, USA: ACM, pp. 267–73.Google Scholar
Yang, T., Jin, R., Chi, Y., and Zhu, S., 2009. Combining link and content for community detection: a discriminative approach. In Proceedings of the 15th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD ’09. New York, USA: ACM, pp. 927–36.Google Scholar
Zha, H., He, X., Ding, C., Simon, H., and Gu, M. 2001. Spectral relaxation for k-means clustering. In The Proceedings of the 2001 Neural Information Processing Systems (NIPS) Conference, vol. 14, pp. 1057–64. Advances in Neural Information Processing Systems. Massachusetts, USA: MIT Press.Google Scholar
Zhang, D., Wang, J., and Si, L., 2011. Document clustering with universum. In Proceedings of the 34th International ACM Conference on Research and Development in Information Retrieval (SIGIR 2011), New York, USA: ACM, pp. 873–82.Google Scholar
Zhou, Y., Cheng, H., and Yu, J. X., 2009. Graph clustering based on structural/attribute similarities. In Proceedings of the VLDB Endowment, vol. 2. Very Large Data Base Endowment Inc (VLDB Endowment), Lyon, France, pp. 718–29.Google Scholar
Zhao, Y., and Karypis, G., 2004. Empirical and theoretical comparisons of selected criterion functions for document clustering. Machine Learning 55 : 311–31.Google Scholar
Zhao, Y., Karypis, G., and Fayyad, U., 2005. Hierarchical clustering algorithms for document datasets. Data Mining and Knowledge Discovery 10 : 141–68.Google Scholar