Hostname: page-component-586b7cd67f-dsjbd Total loading time: 0 Render date: 2024-11-23T04:11:12.672Z Has data issue: false hasContentIssue false

Term evaluation metrics in imbalanced text categorization

Published online by Cambridge University Press:  12 July 2019

Behzad Naderalvojoud*
Affiliation:
Department of Computer Engineering, Hacettepe University, 06800, Ankara, Turkey
Ebru Akcapinar Sezer
Affiliation:
Department of Computer Engineering, Hacettepe University, 06800, Ankara, Turkey
*
*Corresponding author. Emails: [email protected], [email protected]

Abstract

This paper proposes four novel term evaluation metrics to represent documents in the text categorization where class distribution is imbalanced. These metrics are achieved from the revision of the four common term evaluation metrics: chi-square, information gain, odds ratio, and relevance frequency. While the common metrics require a balanced class distribution, our proposed metrics evaluate the document terms under an imbalanced distribution. They calculate the degree of relatedness of terms with respect to minor and major classes by considering their imbalanced distribution. Using these metrics in the document representation makes a better distinction between the documents of the minor and major classes and improves the performance of machine learning algorithms. The proposed metrics are assessed over three popular benchmarks (two subsets of Reuters-21578 and WebKB) by using four classification algorithms: support vector machines, naive Bayes, decision trees, and centroid-based classifiers. Our empirical results indicate that the proposed metrics outperform the common metrics in the imbalanced text categorization.

Type
Article
Copyright
© Cambridge University Press 2019 

Access options

Get access to the full version of this content by using one of the access options below. (Log in options will check for institutional or personal access. Content may require purchase if you do not have access.)

References

Awasare, V.K. and Gupta, S. 2017. Classification of imbalanced datasets using partition method and support vector machine. In Second International Conference on Electrical, Computer and Communication Technologies (ICECCT), IEEE, pp. 17.Google Scholar
Bellinger, C., Drummond, C. and Japkowicz, N. 2018. Manifold-based synthetic oversampling with manifold conformance estimation. Machine Learning 107(3): 605637.CrossRefGoogle Scholar
Bloodgood, M. 2018. Support vector machine active learning algorithms with query-by-committee versus closest-to-hyperplane selection. In 12th International Conference on Semantic Computing (ICSC), IEEE, pp. 148155.Google Scholar
Cachopo, A.M.d.J.C. 2007. Improving methods for single-label text categorization. Ph.D. dissertation, Universidade Técnica de Lisboa.Google Scholar
Chang, C.-C. and Lin, C.-J. 2011. LIBSVM: A library for support vector machines. ACM Transactions on Intelligent Systems and Technology (TIST) 2(3): 27.Google Scholar
Chawla, N.V., Japkowicz, N. and Kotcz, A. 2004. Editorial: Special issue on learning from imbalanced data sets. ACM SIGKDD Explorations Newsletter 6(1): 16.CrossRefGoogle Scholar
Chawla, N.V., Bowyer, K.W., Hall, L.O. and Kegelmeyer, W.P. 2002. SMOTE: synthetic minority over-sampling technique. Journal of Artificial Intelligence Research 16: 321357.CrossRefGoogle Scholar
Chen, E., Lin, Y., Xiong, H., Luo, Q. and Ma, H. 2011. Exploiting probabilistic topic models to improve text categorization under class imbalance. Information Processing and Management 47(2): 202214.CrossRefGoogle Scholar
Debole, F. and Sebastiani, F. 2004. Supervised term weighting for automated text categorization. In Text mining and its applications, Springer, pp. 8197.CrossRefGoogle Scholar
Deng, Z.H., Luo, K.H. and Yu, H.L. 2014. A study of supervised term weighting scheme for sentiment analysis. Expert Systems with Applications 41(7): 35063513.CrossRefGoogle Scholar
Domeniconi, G., Moro, G., Pasolini, R. and Sartori, C. 2015. A Study on term weighting for text categorization: A novel supervised variant of tf.idf. In Proceedings of 4th International Conference on Data Management Technologies and Applications, pp. 26–37.Google Scholar
Dougherty, J., Kohavi, R. and Sahami, M. 1995. Supervised and unsupervised discretization of continuous features. In Proceedings of the Twelfth International Conference on Machine Learning, pp. 194–202.Google Scholar
Erenel, Z. and Altnçay, H. 2012. Nonlinear transformation of term frequencies for term weighting in text categorization. Engineering Applications of Artificial Intelligence 25(7): 15051514.CrossRefGoogle Scholar
Guo, H. and Viktor, H.L. 2004. Learning from imbalanced data sets with boosting and data generation: the DataBoost-IM approach. ACM SIGKDD Explorations Newsletter - Special Issue on Learning from Imbalanced Datasets 6(1): 3039.CrossRefGoogle Scholar
Haddoud, M., Mokhtari, A., Lecroq, T. and Abdeddaïm, S. 2016. Combining supervised term-weighting metrics for SVM text classification with extended term representation. Knowledge and Information Systems 49(3): 909931.CrossRefGoogle Scholar
He, H. and Garcia, E.A. 2009. Learning from imbalanced data. IEEE Transactions on Knowledge and Data Engineering 21(9): 12631284.Google Scholar
Iglesias, E.L., Seara Vieira, A. and Borrajo, L. 2013. An HMM-based over-sampling technique to improve text classification. Expert Systems with Applications 40(18): 71847192.CrossRefGoogle Scholar
Japkowicz, N. and Stephen, S. 2002. The class imbalance problem: A systematic study. Intelligent Data Analysis 6(5): 429449.CrossRefGoogle Scholar
Kim, H.K. and Kim, M. 2016. Model-induced term-weighting schemes for text classification. Applied Intelligence 45(1): 3043.CrossRefGoogle Scholar
Ko, Y., Park, J. and Seo, J. 2004. Improving text categorization using the importance of sentences. Information Processing and Management 40(1): 6579.CrossRefGoogle Scholar
Ko, Y. 2015. A new term-weighting scheme for text classification using the odds of positive and negative class probabilities. Journal of the Association for Information Science and Technology 66(12): 25532565.CrossRefGoogle Scholar
Kübler, S., Liu, C. and Sayyed, Z.A. 2018. To use or not to use: feature selection for sentiment analysis of highly imbalanced data. Natural Language Engineering 24(1): 337.CrossRefGoogle Scholar
Lan, M., Sung, S.Y., Low, H.B. and Tan, C.L. 2005. A comparative study on term weighting schemes for text categorization. In Proceedings of the International Joint Conference on Neural Networks, pp. 546–551.Google Scholar
Lan, M., Tan, C.L., Su, J. and Lu, Y. 2009. Supervised and traditional term weighting methods for automatic text categorization. IEEE Transactions on Pattern Analysis and Machine Intelligence 31(4): 721735.CrossRefGoogle ScholarPubMed
Lee, C. and Lee, G.G. 2006. Information gain and divergence-based feature selection for machine learning-based text categorization. Information Processing and Management 42(1): 155165.CrossRefGoogle Scholar
Liu, X.Y. and Zhou, Z.H. 2006. The influence of class imbalance on cost-sensitive learning: An empirical study. In Sixth International Conference on Data Mining, IEEE, pp. 970974.CrossRefGoogle Scholar
Liu, Y., Loh, H.T., Kamal, Y.-T. and Tor, S.B. 2007. Handling of imbalanced data in text classification: Category-based term weights. In Natural Language Processing and Text Mining, Springer, pp. 171192.CrossRefGoogle Scholar
Liu, Y., Loh, H.T. and Sun, A. 2009. Imbalanced text classification: a term weighting approach. Expert Systems with Applications 36(1): 690701.CrossRefGoogle Scholar
Lustgarten, J.L., Gopalakrishnan, V., Grover, H. and Visweswaran, S. 2008. Improving classification performance with discretization on biomedical datasets. In AMIA Annual Symposium Proceedings, American Medical Informatics Association, pp. 445449.Google Scholar
Maloof, M.A. 2003. Learning when data sets are imbalanced and when costs are unequal and unknown. In Proceedings of the ICML 2003 Workshop on Learning from Imbalanced Data Sets.Google Scholar
McHugh, M.L. 2012. The chi-square test of independence. Biochemia Medica 23(2): 143149.Google Scholar
Moreo, A., Esuli, A. and Sebastiani, F. 2016. Distributional random oversampling for imbalanced text classification. In Proceedings of the 39th International ACM SIGIR conference on Research and Development in Information Retrieval, ACM, pp. 805808.Google Scholar
Naderalvojoud, B., Bozkir, A.S. and Sezer, E.A. 2014. Investigation of term weighting schemes in classification of imbalanced texts. In European Conference on Data Mining (ECDM), Lisbon, pp. 1517.Google Scholar
Naderalvojoud, B., Sezer, E.A. and Ucan, A. 2015. Imbalanced text categorization based on positive and negative term weighting approach. In Text, Speech, and Dialogue, Springer, pp. 325333.CrossRefGoogle Scholar
Nguyen, C. H. and Ho, T.B. 2010. Learning imbalanced data with manifold-based sampling. Japan Advanced Institute of Science and Technology https://www.jaist.ac.jp/~bao/WebPapers/ Google Scholar
Ren, F. and Sohrab, M.G. 2013. Class-indexing-based term weighting for automatic text classification. Information Sciences 236: 109125.CrossRefGoogle Scholar
Robertson, S. 2004. Understanding inverse document frequency: On theoretical arguments for IDF. Journal of Documentation 60(5): 503520.CrossRefGoogle Scholar
Salton, G. and Buckley, C. 1988. Term-weighting approaches in automatic text retrieval. Information Processing and Management 24(5): 513523.CrossRefGoogle Scholar
Sebastiani, F. 2002. Machine learning in automated text categorization. ACM computing surveys (CSUR) 34(1): 147.CrossRefGoogle Scholar
Soucy, P. and Mineau, G.W. 2005. Beyond TFIDF weighting for text categorization in the vector space model. In IJCAI International Joint Conference on Artificial Intelligence, pp. 1130–1135.Google Scholar
Sun, A., Lim, E.-P., Benatallah, B. and Hassan, M. 2006. FISA: feature-based instance selection for imbalanced text classification. In Advances in Knowledge Discovery and Data Mining, Springer, pp. 250254.CrossRefGoogle Scholar
Sun, A., Lim, E.-P. and Liu, Y. 2009. On strategies for imbalanced text classification using SVM: a comparative study. Decision Support Systems 48(1): 191201.CrossRefGoogle Scholar
Taşcı, E. and Güngör, T. 2013. Comparison of text feature selection policies and using an adaptive framework. Expert Systems with Applications 40(12): 48714886.CrossRefGoogle Scholar
Trstenjak, B., Mikac, S. and Donko, D. 2014. KNN with TF-IDF based framework for text categorization. Procedia Engineering 69: 13561364.CrossRefGoogle Scholar
Uysal, A.K. 2016. An improved global feature selection scheme for text classification. Expert Systems with Applications 43: 8292.CrossRefGoogle Scholar
Yang, Y. and Pedersen, J.O. 1997. A comparative study on feature selection in text categorization. In ICML, 97: 412–420.Google Scholar
Yang, J., Liu, Y., Zhu, X., Liu, Z. and Zhang, X. 2012. A new feature selection based on comprehensive measurement both in inter-category and intra-category for text categorization. Information Processing and Management 48(4): 741754.CrossRefGoogle Scholar
Yin, L., Ge, Y., Xiao, K., Wang, X. and Quan, X. 2013. Feature selection for high-dimensional imbalanced data. Neurocomputing 105: 311.CrossRefGoogle Scholar
Zheng, Z., Wu, X. and Srihari, R. 2004. Feature selection for text categorization on imbalanced data. ACM SIGKDD Explorations Newsletter 6(1): 8089.CrossRefGoogle Scholar