Hostname: page-component-78c5997874-s2hrs Total loading time: 0 Render date: 2024-11-03T02:09:35.911Z Has data issue: false hasContentIssue false

Unsupervised modeling anomaly detection in discussion forums posts using global vectors for text representation

Published online by Cambridge University Press:  04 March 2020

Paweł Cichosz*
Affiliation:
Institute of Computer Science, Warsaw University of Technology, Warszawa, Poland E-mail: [email protected]

Abstract

Anomaly detection can be seen as an unsupervised learning task in which a predictive model created on historical data is used to detect outlying instances in new data. This work addresses possibly promising but relatively uncommon application of anomaly detection to text data. Two English-language and one Polish-language Internet discussion forums devoted to psychoactive substances received from home-grown plants, such as hashish or marijuana, serve as text sources that are both realistic and possibly interesting on their own, due to potential associations with drug-related crime. The utility of two different vector text representations is examined: the simple bag of words representation and a more refined Global Vectors (GloVe) representation, which is an example of the increasingly popular word embedding approach. They are both combined with two unsupervised anomaly detection methods, based on one-class support vector machines (SVM) and based on dissimilarity to k-medoids clusters. The GloVe representation is found definitely more useful for anomaly detection, permitting better detection quality and ameliorating the curse of dimensionality issues with text clustering. The cluster dissimilarity approach combined with this representation outperforms one-class SVM with respect to detection quality and appears a more promising approach to anomaly detection in text data.

Type
Article
Copyright
© Cambridge University Press 2020

Access options

Get access to the full version of this content by using one of the access options below. (Log in options will check for institutional or personal access. Content may require purchase if you do not have access.)

References

Aggarwal, C.C. and Zhai, C.-X.(eds.) (2012a). Mining Text Data. New York: Springer.CrossRefGoogle Scholar
Aggarwal, C.C. and Zhai, C.-X. (2012b). A survey of text clustering algorithms. In Aggarwal C.C. and Zhai C.-X. (eds), Mining Text Data. New York: Springer.CrossRefGoogle Scholar
Al-Zoubi, M.B. (2009). An effective clustering-based approach for outlier detection. European Journal of Scientific Research 28, 310316.Google Scholar
Allahyari, M., Pouriyeh, S., Assefi, M., Safaei, S., Trippe, E.D., Gutierrez, J.B. and Kochut, K. (2017). A brief survey of text mining: Classification, clustering and extraction techniques. arXiv preprint arXiv:1707.02919 .Google Scholar
Amer, M. and Goldstein, M. (2012). Nearest-neighbor and clustering based anomaly detection algorithms for RapidMiner. In Proceedings of the Third RapidMiner Community Meeting and Conference (RCOMM-2012). DÜren, Germany: Shaker.Google Scholar
Anderberg, M.R. (1973). Cluster Analysis for Applications. New York: Academic Press.Google Scholar
Arbelaitz, O., Gurrutxaga, I., Muguerza, J., Pérez, J.M. and Perona, I. (2013). An extensive comparative study of cluster validity indices. Pattern Recognition 46, 243256.CrossRefGoogle Scholar
Auslander, B., Gupta, K.M. and Aha, D.W. (2011). A comparative evaluation of anomaly detection algorithms for maritime video surveillance. In Proceedings of SPIE 8019: Sensors, and Command, Control, Communications, and Intelligence, (C3I) Technologies for Homeland Security and Homeland Defense X. Bellingham, WA: SPIE.Google Scholar
Bakarov, A. (2018). A survey of word embeddings evaluation methods. arXiv preprint arXiv:1801.09536 .Google Scholar
Bakarov, A., Yadrintsev, V. and Sochenkov, I. (2018). Anomaly detection for short texts: Identifying whether your chatbot should switch from goal-oriented conversation to chit-chatting. In Proceedings of the International Conference on Digital Transformation and Global Society (DTGS-2018). Cham, Switzerland: Springer.Google Scholar
Balabantaray, R.C., Sarma, C. and Jha, M. (2013). Document clustering using k-means and k-medoids. International Journal of Knowledge Based Computer System 1, 713.Google Scholar
Bertero, C., Roy, M., Sauvanaud, C. and Trédan, G. (2017). Experience report: Log mining using natural language processing and application to anomaly detection. In Proceedings of the Twenty-Eighth International Symposium on Software Reliability Engineering (ISSRE-2017). Los Alamitos, CA: IEEE Computer Society.Google Scholar
Bezdek, J.C. and Pal, N.R. (1998). Some new indexes of cluster validity. IEEE Transactions on Systems, Man and Cybernetics 28, 301315.CrossRefGoogle ScholarPubMed
Blei, D.M., Ng, A.Y. and Jordan, M.I. (2003). Latent Dirichlet allocation. Journal of Machine Learning Research 3, 9931022.Google Scholar
Breiman, L. (2001). Random forests. Machine Learning 45, 532.CrossRefGoogle Scholar
Breiman, L., Friedman, J.H., Olshen, R.A. and Stone, C.J. (1984). Classification and Regression Trees. Boca Raton, FL: Chapman and Hall/CRC.Google Scholar
Breunig, M.M., Kriegel, H.-P., Ng, R.T. and Sander, J. (2000). LOF: Identifying density-based local outliers. In Proceedings of the ACM SIGMOD International Conference on Management of Data. New York: ACM Press.Google Scholar
Chandola, V., Banerjee, A. and Kumar, V. (2009). Anomaly detection: A survey. ACM Computing Surveys 41, 158.CrossRefGoogle Scholar
Chen, T., Tang, L.-A., Sun, Y., Chen, Z. and Zhang, K. (2016). Entity embedding-based anomaly detection for heterogeneous categorical events. In Proceedings of the Twenty-Fifth International Joint Conference on Artificial Intelligence (IJCAI-2016). Menlo Park, CA: AAAI Press.Google Scholar
Cichosz, P. (2015). Data Mining Algorithms: Explained Using R. Chichester, UK: Wiley.CrossRefGoogle Scholar
Cortes, C. and Vapnik, V.N. (1995). Support-vector networks. Machine Learning 20, 273297.CrossRefGoogle Scholar
Dai, H. (2013). Anomaly Detection on Social Data. Ph.D. Thesis, Singapore Management University.Google Scholar
Dařena, F. and Žižka, J. (2017). Ensembles of classifiers for parallel categorization of large number of text documents expressing opinions. Journal of Applied Economic Sciences 12, 2535.Google Scholar
Dhillon, I.S. and Modha, D.S. (2001). Concept decompositions for large sparse text data using clustering. Machine Learning 42, 143175.CrossRefGoogle Scholar
Duchi, J., Hazan, E. and Singer, Y. (2011). Adaptive subgradient methods for online learning and stochastic optimization. Journal of Machine Learning Research 12, 21212159.Google Scholar
Dumais, S.T. (2005). Latent semantic analysis. Annual Review of Information Science and Technology 38, 188229.CrossRefGoogle Scholar
Dumais, S.T., Platt, J.C., Heckerman, D. and Sahami, M. (1998). Inductive learning algorithms and representations for text categorization. In Proceedings of the Seventh International Conference on Information and Knowledge Management (CIKM-98). New York: ACM Press.Google Scholar
Egan, J.P. (1975). Signal Detection Theory and ROC Analysis. New York: Academic Press.Google Scholar
Everitt, B.S., Landau, S., Leese, M. and Stahl, D. (2011). Cluster Analysis, 5th Edn. Chichester, UK: Wiley.CrossRefGoogle Scholar
Fawcett, T. (2006). An introduction to ROC analysis. Pattern Recognition Letters 27, 861874.CrossRefGoogle Scholar
Feldman, R., Fresko, M., Goldenberg, J., Netzer, O. and Ungar, L. (2008). Using text mining to analyze user forums. In Proceedings of the Fifth International Conference on Service Systems and Service Management (ICSSSM-2008). IEEE.CrossRefGoogle Scholar
Forman, G. (2003). An extensive empirical study of feature selection measures for text classification. Journal of Machine Learning Research 3, 12891305.Google Scholar
Gao, Z. (2009). Application of cluster-based local outlier factor algorithm in anti-money laundering. In Proceedings of the International Conference on Management and Service Science (MASS-2009). IEEE.CrossRefGoogle Scholar
Goldberg, Y. and Levy, O. (2014). word2vec explained: Deriving Mikolov et al.’s negative sampling word-embedding method. arXiv preprint arXiv:1402.3722 .Google Scholar
Goldstein, M. and Uchida, S. (2014). Behavior analysis using unsupervised anomaly detection. In Proceedings of the Tenth Joint Workshop on Machine Perception and Robotics (MPR-2014).Google Scholar
Goldstein, M. and Uchida, S. (2016). A comparative evaluation of unsupervised anomaly detection algorithms for multivariate data. PLOS ONE 11, e0152173.CrossRefGoogle ScholarPubMed
Guthrie, D., Guthrie, L., Allison, B. and Wilks, Y. (2007). Unsupervised anomaly detection. In Proceedings of the Twentieth International Joint Conference on Artificial Intelligence (IJCAI-2007). San Francisco, CA: Morgan Kaufmann.Google Scholar
Halkidi, M., Batistakis, Y. and Vazirgiannis, M. (2001). On clustering validation techniques. Journal of Intelligent Information Systems 17, 107145.CrossRefGoogle Scholar
Hamel, L.H. (2009). Knowledge Discovery with Support Vector Machines. Hoboken, NJ: Wiley.CrossRefGoogle Scholar
Hartigan, J.A. (1975). Clustering Algorithms. New York: Wiley.Google Scholar
Hassan, S., Mihalcea, R. and Banea, C. (2007). Random-walk term weighting for improved text classification. In Proceedings of the First IEEE International Conference on Semantic Computing (ICSC-2007). Los Alamitos, CA: IEEE Computer Society.Google Scholar
He, Z., Xu, X. and Deng, S. (2003). Discovering cluster-based local outliers. Pattern Recognition Letters 24, 16411650.CrossRefGoogle Scholar
Holtz, P., Kronberger, N. and Wagner, W. (2012). Analyzing Internet forums: A practical guide. Journal of Media Psychology 24, 5566.CrossRefGoogle Scholar
Hoogeveen, D., Wang, L., Baldwin, T. and Verspoor, K.M. (2018). Web forum retrieval and text analytics: A survey. Foundations and Trends in Information Retrieval 12, 163.CrossRefGoogle Scholar
Jain, A.K. and Dubes, R.C. (1988). Algorithms for Clustering Data. Englewood Cliffs, NJ: Prentice-Hall.Google Scholar
Jayasimhan, A. and Gadge, J. (2012). Anomaly detection using a clustering technique. International Journal of Applied Information Systems 2, 59.CrossRefGoogle Scholar
Joachims, T. (1998). Text categorization with support vector machines: Learning with many relevant features. In Proceedings of the Tenth European Conference on Machine Learning (ECML-98). Berlin, Germany: Springer.Google Scholar
Joachims, T. (2002). Learning to Classify Text by Support Vector Machines: Methods, Theory, and Algorithms. New York: Springer.CrossRefGoogle Scholar
Kannan, R., Woo, H., Aggarwal, C.C., and Park, H. (2017). Outlier detection for text data. In Proceedings of the 2017 SIAM International Conference on Data Mining. Philadelphia, PA: SIAM.Google Scholar
Kaufman, L. and Rousseeuw, P.J. (1990). Finding Groups in Data: An Introduction to Cluster Analysis. Hoboken, NJ: Wiley.CrossRefGoogle Scholar
Koprinska, I., Poon, J., Clark, J. and Chan, J. (2007). Learning to classify e-mail. Information Sciences: An International Journal 177, 21672187.CrossRefGoogle Scholar
Kramer, S. (2010). Anomaly detection in extremist web forums using a dynamical systems approach. In Proceedings of the ACM SIGKDD Workshop on Intelligence and Security Informatics (ISI-KDD-2010). New York, NY: ACM Press.Google Scholar
Kumar, C.A. and Srinivas, S. (2006). Latent semantic indexing using eigenvalue analysis for efficient information retrieval. International Journal of Applied Mathematics and Computer Science 16, 551558.Google Scholar
Kumaraswamy, R., Wazalwar, A. and Khot, K. (2015). Anomaly detection in text: The value of domain knowledge. In Proceedings of the Twenty-Eighth International Florida Artificial Intelligence Research Society Conference (FLAIRS-2015). Menlo Park, CA: AAAI Press.Google Scholar
Lau, J.H. and Baldwin, T. (2016). An empirical evaluation of doc2vec with practical insights into document embedding generation. In Proceedings of the First Workshop on Representation Learning for NLP. Stroudsburg, PA: Association for Computational Linguistics.Google Scholar
Le, Q.V. and Mikolov, T. (2014). Distributed representations of sentences and documents. In Proceedings of the Thirty-First International Conference on Machine Learning (ICML-2014). JMLR Workshop and Conference Proceedings.Google Scholar
Lei, D., Zhu, Q., Chen, J., Lin, H. and Yang, P. (2012). Automatic PAM clustering algorithm for outlier detection. Journal of Software 7, 10451051.CrossRefGoogle Scholar
Li, N. and Wu, D.D. (2010). Using text mining and sentiment analysis for online forums hotspot detection and forecast. Decision Support Systems 48, 354368.CrossRefGoogle Scholar
Liaw, A. and Wiener, M. (2002). Classification and regression by randomForest. R News 2(3), 1822.Google Scholar
Liu, F.T., Ting, K.M. and Zhou, Z.-H. (2012). Isolation-based anomaly detection. ACM Transactions on Knowledge Discovery from Data 6, 3.CrossRefGoogle Scholar
Lloyd, S.P. (1957). Least Squares Quantization in PCM. Technical report Bell Laboratories. Reprinted in 1982 in IEEE Transactions on Information Theory, 28:128137.Google Scholar
Lui, A.K.-F., Li, S.C. and Choy, S.O. (2007). An evaluation of automatic text categorization in online discussion analysis. In Proceedings of the Seventh IEEE International Conference on Advanced Learning Technologies (ICALT-2007). Los Alamitos, CA: IEEE Computer Society.Google Scholar
MacQueen, J. (1967). Some methods for classification and analysis of multivariate observations. In Proceedings of the Fifth Berkeley Symposium on Mathematical Statistics and Probability. Berkeley, CA: University of California Press.Google Scholar
Maechler, M., Rousseeuw, P., Struyf, A., Hubert, M. and Hornik, K. (2018). cluster: Cluster analysis basics and extensions. R package version 2.0.7-1.Google Scholar
Mahapatra, A., Srivastava, N. and Srivastava, J. (2012). Contextual anomaly detection in text data. Algorithms 2012, 469489.CrossRefGoogle Scholar
Manevitz, L. and Yousef, M. (2002). One-class SVMs for document classification. Journal of Machine Learning Research 2, 139154.Google Scholar
Manning, C.D., Raghavan, P. and Schütze, H. (2008). Introduction to Information Retrieval. Cambridge, UK: Cambridge University Press.CrossRefGoogle Scholar
Marra, R.M., Moore, J.L. and Klimczak, A.K. (2004). Content analysis of online discussion forums: A comparative analysis of protocols. Educational Technology Research and Development 52, 2340.CrossRefGoogle Scholar
Meyer, D. and Buchta, C. (2018). proxy: Distance and similarity measures. R package version 0.4-22.Google Scholar
Meyer, D., Dimitriadou, E., Hornik, K., Weingessel, A. and Leisch, F. (2018). e1071: Misc functions of the department of statistics, probability theory group (formerly: E1071), TU Wien. R package version 1.7-0.Google Scholar
McCallum, A. and Nigam, K. (1998). A comparison of event models for naive Bayes text classification. In Proceedings of the AAAI/ICML-98 Workshop on Learning for Text Categorization. Menlo Park, CA: AAAI Press.Google Scholar
Mikolov, T., Chen, K., Corrado, G.S. and Dean, J. (2013a). Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781 .Google Scholar
Mikolov, T., Le, Q.V. and Sutskever, I. (2013b). Exploiting similarities among languages for machine translation. arXiv preprint arXiv:1309.4168 .Google Scholar
Mitchell, J. and Lapata, M. (2010). Composition in distributional models of semantics. Cognitive Science 34, 13881429.CrossRefGoogle ScholarPubMed
Moldovan, A., Bot, R.I. and Wanka, G. (2005). Latent semantic indexing for patent documents. International Journal of Applied Mathematics and Computer Science 15, 551560.Google Scholar
Münz, H., Li, S. and Carle, G. (2007). Traffic anomaly detection using k-means clustering. In Proceedings of the Fourth GI/ITG-Workshop MMBnet.Google Scholar
Oooms, J. (2018). hunspell: Morphological analysis and spell checker for R. R package version 3.0.Google Scholar
Pande, A. and Ahuja, V. (2017). WEAC: Word embeddings for anomaly classification from event logs. In Proceedings of the 2017 IEEE International Conference on Big Data. Los Alamitos, CA: IEEE Computer Society.Google Scholar
Pennington, J., Socher, R. and Manning, C.D. (2014). GloVe: Global vectors for word representation. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP-2014). Stroudsburg, PA: Association for Computational Linguistics.Google Scholar
Pimentel, M.A.F., Clifton, D.A., Clifton, L. and Tarassenko, L. (2014). A review of novelty detection. Signal Processing 99, 215249.CrossRefGoogle Scholar
Platt, J.C. (1998). Fast training of support vector machines using sequential minimal optimization. In Schölkopf B., Burges C.J.C. and Smola A.J. (eds), Advances in Kernel Methods: Support Vector Learning. Cambridge, MA: MIT Press.Google Scholar
Platt, J.C. (2000). Probabilistic outputs for support vector machines and comparison to regularized likelihood methods. In Smola A.J., Barlett P., Schölkopf B. and Schuurmans D. (eds), Advances in Large Margin Classifiers. Cambridge, MA: MIT Press.Google Scholar
Quinlan, J.R. (1986). Induction of decision trees. Machine Learning 1, 81106.CrossRefGoogle Scholar
Radovanović, M. and Ivanović, M. (2008). Text mining: Approaches and applications. Novi Sad Journal of Mathematics 38, 227234.Google Scholar
Rios, G. and Zha, H. (2004). Exploring support vector machines and random forests for spam detection. In Proceedings of the First International Conference on Email and Anti Spam (CEAS-2004).Google Scholar
Rousseau, F., Kiagias, E. and Vazirgiannis, M. (2015). Text categorization as a graph classification problem. In Proceedings of the Fifty-Third Annual Meeting of the Association for Computational Linguistics and the Sixth International Joint Conference on Natural Language Processing (ACL-IJCNLP-2015). Stroudsburg, PA: Association for Computational Linguistics.Google Scholar
Rousseeuw, P.J. (1987). Silhouettes: A graphical aid to the interpretation and validation of cluster analysis. Computational and Applied Mathematics 20, 5365.CrossRefGoogle Scholar
Said, D. and Wanas, N. 2011. Clustering posts in online discussion forum threads. International Journal of Computer Science and Information Technology, 3, 114.CrossRefGoogle Scholar
Schölkopf, B., Platt, J.C., Shawe-Taylor, J.C., Smola, A.J. and Williamson, R.C. (2001). Estimating the support of a high-dimensional distribution. Neural Computation 13, 14431471.CrossRefGoogle ScholarPubMed
Sebastiani, F. (2002). Machine learning in automated text categorization. ACM Computing Surveys 34, 147.CrossRefGoogle Scholar
Selivanov, D. and Quing, W. (2018). text2vec: Modern text mining framework for R. R package version 0.5.1.Google Scholar
Syarif, I., Prugel-Bennett, A. and Wills, G. (2012). Unsupervised clustering approach for network anomaly detection. In Proceedings of the Fourth International Conference on Networked Digital Technologies (NDT-2012). Heidelberg, Germany: Springer.Google Scholar
Szymański, J. (2014). Comparative analysis of text representation methods using classification. Cybernetics and Systems 45, 180199.CrossRefGoogle Scholar
Xue, D. and Li, F. (2015). Research of text categorization model based on random forests. In 2015 IEEE International Conference on Computational Intelligence and Communication Technology (CICT-2015). Los Alamitos, CA: IEEE Computer Society.Google Scholar
Yang, Y. and Pedersen, J. (1997). A comparative study on feature selection in text categorization. In Proceedings of the Fourteenth International Conference on Machine Learning (ICML-97). San Francisco, CA: Morgan Kaufmann.Google Scholar
Yessenalina, A. and Cardie, C. (2011). Compositional matrix-space models for sentiment analysis. In Proceedings of the 2011 Conference on Empirical Methods in Natural Language Processing (EMNLP-2011). Stroudsburg, PA: Association for Computational Linguistics.Google Scholar
Zimek, A., Schubert, E. and Kriegel, H.-P. (2012). A survey on unsupervised outlier detection in high-dimensional numerical data. Statistical Analysis and Data Mining: The ASA Data Science Journal 5, 363387.CrossRefGoogle Scholar