Unsupervised modeling anomaly detection in discussion forums posts using global vectors for text representation

Paweł Cichosz

doi:10.1017/S1351324920000066

Unsupervised modeling anomaly detection in discussion forums posts using global vectors for text representation

Published online by Cambridge University Press: 04 March 2020

Paweł Cichosz

Show author details

Paweł Cichosz*: Affiliation:
Institute of Computer Science, Warsaw University of Technology, Warszawa, Poland E-mail: [email protected]

Article contents

Abstract
References

Get access

Rights & Permissions

Abstract

Anomaly detection can be seen as an unsupervised learning task in which a predictive model created on historical data is used to detect outlying instances in new data. This work addresses possibly promising but relatively uncommon application of anomaly detection to text data. Two English-language and one Polish-language Internet discussion forums devoted to psychoactive substances received from home-grown plants, such as hashish or marijuana, serve as text sources that are both realistic and possibly interesting on their own, due to potential associations with drug-related crime. The utility of two different vector text representations is examined: the simple bag of words representation and a more refined Global Vectors (GloVe) representation, which is an example of the increasingly popular word embedding approach. They are both combined with two unsupervised anomaly detection methods, based on one-class support vector machines (SVM) and based on dissimilarity to k-medoids clusters. The GloVe representation is found definitely more useful for anomaly detection, permitting better detection quality and ameliorating the curse of dimensionality issues with text clustering. The cluster dissimilarity approach combined with this representation outperforms one-class SVM with respect to detection quality and appears a more promising approach to anomaly detection in text data.

Keywords

Text classification Text clustering Anomaly detection Word embeddings

Type: Article
Information: Natural Language Engineering , Volume 26 , Issue 5 , September 2020 , pp. 551 - 578

DOI: https://doi.org/10.1017/S1351324920000066 [Opens in a new window]
Copyright: © Cambridge University Press 2020

Access options

Get access to the full version of this content by using one of the access options below. (Log in options will check for institutional or personal access. Content may require purchase if you do not have access.)

Article purchase

Temporarily unavailable

References

Aggarwal, C.C. and Zhai, C.-X.(eds.) (2012a). Mining Text Data. New York: Springer.CrossRef Google Scholar

Aggarwal, C.C. and Zhai, C.-X. (2012b). A survey of text clustering algorithms. In Aggarwal C.C. and Zhai C.-X. (eds), Mining Text Data. New York: Springer.CrossRef Google Scholar

Al-Zoubi, M.B. (2009). An effective clustering-based approach for outlier detection. European Journal of Scientific Research 28, 310–316.Google Scholar

Allahyari, M., Pouriyeh, S., Assefi, M., Safaei, S., Trippe, E.D., Gutierrez, J.B. and Kochut, K. (2017). A brief survey of text mining: Classification, clustering and extraction techniques. arXiv preprint arXiv:1707.02919 .Google Scholar

Amer, M. and Goldstein, M. (2012). Nearest-neighbor and clustering based anomaly detection algorithms for RapidMiner. In Proceedings of the Third RapidMiner Community Meeting and Conference (RCOMM-2012). DÜren, Germany: Shaker.Google Scholar

Anderberg, M.R. (1973). Cluster Analysis for Applications. New York: Academic Press.Google Scholar

Arbelaitz, O., Gurrutxaga, I., Muguerza, J., Pérez, J.M. and Perona, I. (2013). An extensive comparative study of cluster validity indices. Pattern Recognition 46, 243–256.CrossRef Google Scholar

Auslander, B., Gupta, K.M. and Aha, D.W. (2011). A comparative evaluation of anomaly detection algorithms for maritime video surveillance. In Proceedings of SPIE 8019: Sensors, and Command, Control, Communications, and Intelligence, (C3I) Technologies for Homeland Security and Homeland Defense X. Bellingham, WA: SPIE.Google Scholar

Bakarov, A. (2018). A survey of word embeddings evaluation methods. arXiv preprint arXiv:1801.09536 .Google Scholar

Bakarov, A., Yadrintsev, V. and Sochenkov, I. (2018). Anomaly detection for short texts: Identifying whether your chatbot should switch from goal-oriented conversation to chit-chatting. In Proceedings of the International Conference on Digital Transformation and Global Society (DTGS-2018). Cham, Switzerland: Springer.Google Scholar

Balabantaray, R.C., Sarma, C. and Jha, M. (2013). Document clustering using k-means and k-medoids. International Journal of Knowledge Based Computer System 1, 7–13.Google Scholar

Bertero, C., Roy, M., Sauvanaud, C. and Trédan, G. (2017). Experience report: Log mining using natural language processing and application to anomaly detection. In Proceedings of the Twenty-Eighth International Symposium on Software Reliability Engineering (ISSRE-2017). Los Alamitos, CA: IEEE Computer Society.Google Scholar

Bezdek, J.C. and Pal, N.R. (1998). Some new indexes of cluster validity. IEEE Transactions on Systems, Man and Cybernetics 28, 301–315.CrossRef Google Scholar PubMed

Blei, D.M., Ng, A.Y. and Jordan, M.I. (2003). Latent Dirichlet allocation. Journal of Machine Learning Research 3, 993–1022.Google Scholar

Breiman, L. (2001). Random forests. Machine Learning 45, 5–32.CrossRef Google Scholar

Breiman, L., Friedman, J.H., Olshen, R.A. and Stone, C.J. (1984). Classification and Regression Trees. Boca Raton, FL: Chapman and Hall/CRC.Google Scholar

Breunig, M.M., Kriegel, H.-P., Ng, R.T. and Sander, J. (2000). LOF: Identifying density-based local outliers. In Proceedings of the ACM SIGMOD International Conference on Management of Data. New York: ACM Press.Google Scholar

Chandola, V., Banerjee, A. and Kumar, V. (2009). Anomaly detection: A survey. ACM Computing Surveys 41, 1–58.CrossRef Google Scholar

Chen, T., Tang, L.-A., Sun, Y., Chen, Z. and Zhang, K. (2016). Entity embedding-based anomaly detection for heterogeneous categorical events. In Proceedings of the Twenty-Fifth International Joint Conference on Artificial Intelligence (IJCAI-2016). Menlo Park, CA: AAAI Press.Google Scholar

Cichosz, P. (2015). Data Mining Algorithms: Explained Using R. Chichester, UK: Wiley.CrossRef Google Scholar

Cortes, C. and Vapnik, V.N. (1995). Support-vector networks. Machine Learning 20, 273–297.CrossRef Google Scholar

Dai, H. (2013). Anomaly Detection on Social Data. Ph.D. Thesis, Singapore Management University.Google Scholar

Dařena, F. and Žižka, J. (2017). Ensembles of classifiers for parallel categorization of large number of text documents expressing opinions. Journal of Applied Economic Sciences 12, 25–35.Google Scholar

Dhillon, I.S. and Modha, D.S. (2001). Concept decompositions for large sparse text data using clustering. Machine Learning 42, 143–175.CrossRef Google Scholar

Duchi, J., Hazan, E. and Singer, Y. (2011). Adaptive subgradient methods for online learning and stochastic optimization. Journal of Machine Learning Research 12, 2121–2159.Google Scholar

Dumais, S.T. (2005). Latent semantic analysis. Annual Review of Information Science and Technology 38, 188–229.CrossRef Google Scholar

Dumais, S.T., Platt, J.C., Heckerman, D. and Sahami, M. (1998). Inductive learning algorithms and representations for text categorization. In Proceedings of the Seventh International Conference on Information and Knowledge Management (CIKM-98). New York: ACM Press.Google Scholar

Egan, J.P. (1975). Signal Detection Theory and ROC Analysis. New York: Academic Press.Google Scholar

Everitt, B.S., Landau, S., Leese, M. and Stahl, D. (2011). Cluster Analysis, 5th Edn. Chichester, UK: Wiley.CrossRef Google Scholar

Fawcett, T. (2006). An introduction to ROC analysis. Pattern Recognition Letters 27, 861–874.CrossRef Google Scholar

Feldman, R., Fresko, M., Goldenberg, J., Netzer, O. and Ungar, L. (2008). Using text mining to analyze user forums. In Proceedings of the Fifth International Conference on Service Systems and Service Management (ICSSSM-2008). IEEE.CrossRef Google Scholar

Forman, G. (2003). An extensive empirical study of feature selection measures for text classification. Journal of Machine Learning Research 3, 1289–1305.Google Scholar

Gao, Z. (2009). Application of cluster-based local outlier factor algorithm in anti-money laundering. In Proceedings of the International Conference on Management and Service Science (MASS-2009). IEEE.CrossRef Google Scholar

Goldberg, Y. and Levy, O. (2014). word2vec explained: Deriving Mikolov et al.’s negative sampling word-embedding method. arXiv preprint arXiv:1402.3722 .Google Scholar

Goldstein, M. and Uchida, S. (2014). Behavior analysis using unsupervised anomaly detection. In Proceedings of the Tenth Joint Workshop on Machine Perception and Robotics (MPR-2014).Google Scholar

Goldstein, M. and Uchida, S. (2016). A comparative evaluation of unsupervised anomaly detection algorithms for multivariate data. PLOS ONE 11, e0152173.CrossRef Google Scholar PubMed

Guthrie, D., Guthrie, L., Allison, B. and Wilks, Y. (2007). Unsupervised anomaly detection. In Proceedings of the Twentieth International Joint Conference on Artificial Intelligence (IJCAI-2007). San Francisco, CA: Morgan Kaufmann.Google Scholar

Halkidi, M., Batistakis, Y. and Vazirgiannis, M. (2001). On clustering validation techniques. Journal of Intelligent Information Systems 17, 107–145.CrossRef Google Scholar

Hamel, L.H. (2009). Knowledge Discovery with Support Vector Machines. Hoboken, NJ: Wiley.CrossRef Google Scholar

Hartigan, J.A. (1975). Clustering Algorithms. New York: Wiley.Google Scholar

Hassan, S., Mihalcea, R. and Banea, C. (2007). Random-walk term weighting for improved text classification. In Proceedings of the First IEEE International Conference on Semantic Computing (ICSC-2007). Los Alamitos, CA: IEEE Computer Society.Google Scholar

He, Z., Xu, X. and Deng, S. (2003). Discovering cluster-based local outliers. Pattern Recognition Letters 24, 1641–1650.CrossRef Google Scholar

Holtz, P., Kronberger, N. and Wagner, W. (2012). Analyzing Internet forums: A practical guide. Journal of Media Psychology 24, 55–66.CrossRef Google Scholar

Hoogeveen, D., Wang, L., Baldwin, T. and Verspoor, K.M. (2018). Web forum retrieval and text analytics: A survey. Foundations and Trends in Information Retrieval 12, 1–63.CrossRef Google Scholar

Jain, A.K. and Dubes, R.C. (1988). Algorithms for Clustering Data. Englewood Cliffs, NJ: Prentice-Hall.Google Scholar

Jayasimhan, A. and Gadge, J. (2012). Anomaly detection using a clustering technique. International Journal of Applied Information Systems 2, 5–9.CrossRef Google Scholar

Joachims, T. (1998). Text categorization with support vector machines: Learning with many relevant features. In Proceedings of the Tenth European Conference on Machine Learning (ECML-98). Berlin, Germany: Springer.Google Scholar

Joachims, T. (2002). Learning to Classify Text by Support Vector Machines: Methods, Theory, and Algorithms. New York: Springer.CrossRef Google Scholar

Kannan, R., Woo, H., Aggarwal, C.C., and Park, H. (2017). Outlier detection for text data. In Proceedings of the 2017 SIAM International Conference on Data Mining. Philadelphia, PA: SIAM.Google Scholar

Kaufman, L. and Rousseeuw, P.J. (1990). Finding Groups in Data: An Introduction to Cluster Analysis. Hoboken, NJ: Wiley.CrossRef Google Scholar

Koprinska, I., Poon, J., Clark, J. and Chan, J. (2007). Learning to classify e-mail. Information Sciences: An International Journal 177, 2167–2187.CrossRef Google Scholar

Kramer, S. (2010). Anomaly detection in extremist web forums using a dynamical systems approach. In Proceedings of the ACM SIGKDD Workshop on Intelligence and Security Informatics (ISI-KDD-2010). New York, NY: ACM Press.Google Scholar

Kumar, C.A. and Srinivas, S. (2006). Latent semantic indexing using eigenvalue analysis for efficient information retrieval. International Journal of Applied Mathematics and Computer Science 16, 551–558.Google Scholar

Kumaraswamy, R., Wazalwar, A. and Khot, K. (2015). Anomaly detection in text: The value of domain knowledge. In Proceedings of the Twenty-Eighth International Florida Artificial Intelligence Research Society Conference (FLAIRS-2015). Menlo Park, CA: AAAI Press.Google Scholar

Lau, J.H. and Baldwin, T. (2016). An empirical evaluation of doc2vec with practical insights into document embedding generation. In Proceedings of the First Workshop on Representation Learning for NLP. Stroudsburg, PA: Association for Computational Linguistics.Google Scholar

Le, Q.V. and Mikolov, T. (2014). Distributed representations of sentences and documents. In Proceedings of the Thirty-First International Conference on Machine Learning (ICML-2014). JMLR Workshop and Conference Proceedings.Google Scholar

Lei, D., Zhu, Q., Chen, J., Lin, H. and Yang, P. (2012). Automatic PAM clustering algorithm for outlier detection. Journal of Software 7, 1045–1051.CrossRef Google Scholar

Li, N. and Wu, D.D. (2010). Using text mining and sentiment analysis for online forums hotspot detection and forecast. Decision Support Systems 48, 354–368.CrossRef Google Scholar

Liaw, A. and Wiener, M. (2002). Classification and regression by randomForest. R News 2(3), 18–22.Google Scholar

Liu, F.T., Ting, K.M. and Zhou, Z.-H. (2012). Isolation-based anomaly detection. ACM Transactions on Knowledge Discovery from Data 6, 3.CrossRef Google Scholar

Lloyd, S.P. (1957). Least Squares Quantization in PCM. Technical report Bell Laboratories. Reprinted in 1982 in IEEE Transactions on Information Theory, 28:128–137.Google Scholar

Lui, A.K.-F., Li, S.C. and Choy, S.O. (2007). An evaluation of automatic text categorization in online discussion analysis. In Proceedings of the Seventh IEEE International Conference on Advanced Learning Technologies (ICALT-2007). Los Alamitos, CA: IEEE Computer Society.Google Scholar

MacQueen, J. (1967). Some methods for classification and analysis of multivariate observations. In Proceedings of the Fifth Berkeley Symposium on Mathematical Statistics and Probability. Berkeley, CA: University of California Press.Google Scholar

Maechler, M., Rousseeuw, P., Struyf, A., Hubert, M. and Hornik, K. (2018). cluster: Cluster analysis basics and extensions. R package version 2.0.7-1.Google Scholar

Mahapatra, A., Srivastava, N. and Srivastava, J. (2012). Contextual anomaly detection in text data. Algorithms 2012, 469–489.CrossRef Google Scholar

Manevitz, L. and Yousef, M. (2002). One-class SVMs for document classification. Journal of Machine Learning Research 2, 139–154.Google Scholar

Manning, C.D., Raghavan, P. and Schütze, H. (2008). Introduction to Information Retrieval. Cambridge, UK: Cambridge University Press.CrossRef Google Scholar

Marra, R.M., Moore, J.L. and Klimczak, A.K. (2004). Content analysis of online discussion forums: A comparative analysis of protocols. Educational Technology Research and Development 52, 23–40.CrossRef Google Scholar

Meyer, D. and Buchta, C. (2018). proxy: Distance and similarity measures. R package version 0.4-22.Google Scholar

Meyer, D., Dimitriadou, E., Hornik, K., Weingessel, A. and Leisch, F. (2018). e1071: Misc functions of the department of statistics, probability theory group (formerly: E1071), TU Wien. R package version 1.7-0.Google Scholar

McCallum, A. and Nigam, K. (1998). A comparison of event models for naive Bayes text classification. In Proceedings of the AAAI/ICML-98 Workshop on Learning for Text Categorization. Menlo Park, CA: AAAI Press.Google Scholar

Mikolov, T., Chen, K., Corrado, G.S. and Dean, J. (2013a). Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781 .Google Scholar

Mikolov, T., Le, Q.V. and Sutskever, I. (2013b). Exploiting similarities among languages for machine translation. arXiv preprint arXiv:1309.4168 .Google Scholar

Mitchell, J. and Lapata, M. (2010). Composition in distributional models of semantics. Cognitive Science 34, 1388–1429.CrossRef Google Scholar PubMed

Moldovan, A., Bot, R.I. and Wanka, G. (2005). Latent semantic indexing for patent documents. International Journal of Applied Mathematics and Computer Science 15, 551–560.Google Scholar

Münz, H., Li, S. and Carle, G. (2007). Traffic anomaly detection using k-means clustering. In Proceedings of the Fourth GI/ITG-Workshop MMBnet.Google Scholar

Oooms, J. (2018). hunspell: Morphological analysis and spell checker for R. R package version 3.0.Google Scholar

Pande, A. and Ahuja, V. (2017). WEAC: Word embeddings for anomaly classification from event logs. In Proceedings of the 2017 IEEE International Conference on Big Data. Los Alamitos, CA: IEEE Computer Society.Google Scholar

Pennington, J., Socher, R. and Manning, C.D. (2014). GloVe: Global vectors for word representation. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP-2014). Stroudsburg, PA: Association for Computational Linguistics.Google Scholar

Pimentel, M.A.F., Clifton, D.A., Clifton, L. and Tarassenko, L. (2014). A review of novelty detection. Signal Processing 99, 215–249.CrossRef Google Scholar

Platt, J.C. (1998). Fast training of support vector machines using sequential minimal optimization. In Schölkopf B., Burges C.J.C. and Smola A.J. (eds), Advances in Kernel Methods: Support Vector Learning. Cambridge, MA: MIT Press.Google Scholar

Platt, J.C. (2000). Probabilistic outputs for support vector machines and comparison to regularized likelihood methods. In Smola A.J., Barlett P., Schölkopf B. and Schuurmans D. (eds), Advances in Large Margin Classifiers. Cambridge, MA: MIT Press.Google Scholar

Quinlan, J.R. (1986). Induction of decision trees. Machine Learning 1, 81–106.CrossRef Google Scholar

Radovanović, M. and Ivanović, M. (2008). Text mining: Approaches and applications. Novi Sad Journal of Mathematics 38, 227–234.Google Scholar

Rios, G. and Zha, H. (2004). Exploring support vector machines and random forests for spam detection. In Proceedings of the First International Conference on Email and Anti Spam (CEAS-2004).Google Scholar

Rousseau, F., Kiagias, E. and Vazirgiannis, M. (2015). Text categorization as a graph classification problem. In Proceedings of the Fifty-Third Annual Meeting of the Association for Computational Linguistics and the Sixth International Joint Conference on Natural Language Processing (ACL-IJCNLP-2015). Stroudsburg, PA: Association for Computational Linguistics.Google Scholar

Rousseeuw, P.J. (1987). Silhouettes: A graphical aid to the interpretation and validation of cluster analysis. Computational and Applied Mathematics 20, 53–65.CrossRef Google Scholar

Said, D. and Wanas, N. 2011. Clustering posts in online discussion forum threads. International Journal of Computer Science and Information Technology, 3, 1–14.CrossRef Google Scholar

Schölkopf, B., Platt, J.C., Shawe-Taylor, J.C., Smola, A.J. and Williamson, R.C. (2001). Estimating the support of a high-dimensional distribution. Neural Computation 13, 1443–1471.CrossRef Google Scholar PubMed

Sebastiani, F. (2002). Machine learning in automated text categorization. ACM Computing Surveys 34, 1–47.CrossRef Google Scholar

Selivanov, D. and Quing, W. (2018). text2vec: Modern text mining framework for R. R package version 0.5.1.Google Scholar

Syarif, I., Prugel-Bennett, A. and Wills, G. (2012). Unsupervised clustering approach for network anomaly detection. In Proceedings of the Fourth International Conference on Networked Digital Technologies (NDT-2012). Heidelberg, Germany: Springer.Google Scholar

Szymański, J. (2014). Comparative analysis of text representation methods using classification. Cybernetics and Systems 45, 180–199.CrossRef Google Scholar

Xue, D. and Li, F. (2015). Research of text categorization model based on random forests. In 2015 IEEE International Conference on Computational Intelligence and Communication Technology (CICT-2015). Los Alamitos, CA: IEEE Computer Society.Google Scholar

Yang, Y. and Pedersen, J. (1997). A comparative study on feature selection in text categorization. In Proceedings of the Fourteenth International Conference on Machine Learning (ICML-97). San Francisco, CA: Morgan Kaufmann.Google Scholar

Yessenalina, A. and Cardie, C. (2011). Compositional matrix-space models for sentiment analysis. In Proceedings of the 2011 Conference on Empirical Methods in Natural Language Processing (EMNLP-2011). Stroudsburg, PA: Association for Computational Linguistics.Google Scholar

Zimek, A., Schubert, E. and Kriegel, H.-P. (2012). A survey on unsupervised outlier detection in high-dimensional numerical data. Statistical Analysis and Data Mining: The ASA Data Science Journal 5, 363–387.CrossRef Google Scholar

Article contents

Unsupervised modeling anomaly detection in discussion forums posts using global vectors for text representation

Abstract

Keywords

Access options

Article purchase

Temporarily unavailable

References

Save article to Kindle

Save article to Dropbox

Save article to Google Drive

Reply to: Submit a response

Your details

You have entered the maximum number of contributors

Conflicting interests