Hostname: page-component-586b7cd67f-g8jcs Total loading time: 0 Render date: 2024-11-25T10:01:34.909Z Has data issue: false hasContentIssue false

Discretization as the enabling technique for the Naïve Bayes and semi-Naïve Bayes-based classification

Published online by Cambridge University Press:  01 December 2010

Marcin J. Mizianty*
Affiliation:
Department of Electrical and Computer Engineering, University of Alberta, Edmonton, Canada; e-mail: [email protected], [email protected]
Lukasz A. Kurgan*
Affiliation:
Department of Electrical and Computer Engineering, University of Alberta, Edmonton, Canada; e-mail: [email protected], [email protected]
Marek R. Ogiela*
Affiliation:
Bio-cybernetics Laboratory, Institute of Automatics, AGH University of Science and Technology, Krakow, Poland; e-mail: [email protected]

Abstract

Current classification problems that concern data sets of large and increasing size require scalable classification algorithms. In this study, we concentrate on several scalable, linear complexity classifiers that include one of the top 10 voted data mining methods, Naïve Bayes (NB), and several recently proposed semi-NB classifiers. These algorithms perform front-end discretization of the continuous features since by design they work only with nominal or discrete features. We address the lack of studies that investigate the benefits and drawbacks of discretization in the context of the subsequent classification. Our comprehensive empirical study considers 12 discretizers (two unsupervised and 10 supervised), seven classifiers (two classical NB and five semi-NB), and 16 data sets. We investigate the scalability of the discretizers and show that the fastest supervised discretizers fast class-attribute interdependency maximization (FCAIM), class-attribute interdependency maximization (CAIM), and information entropy maximization (IEM) provide discretization schemes with the highest overall quality. We show that discretization improves the classification accuracy when compared against the two classical methods, NB and Flexible Naïve Bayes (FNB), executed on the raw data. The choice of the discretization algorithm impacts the significance of the improvements. The MODL, FCAIM, and CAIM methods provide statistically significant improvements, while the IEM, Class-attribute contingency coefficient (CACC), and Khiops discretizers provide moderate improvements. The most accurate classification models are generated by the Averaged one-dependence estimators (AODEsr) classifier followed by AODE and HNB (Hidden Naïve Bayes). AODEsr run on data discretized with MODL, FCAIM, and CAIM provides statistically significantly better accuracies than both the classical NB methods. The worst results are obtained with the NB, FNB, and LBR (Lazy Bayes rule) classifiers. We show that although the time to build the discretization scheme could be longer than the time to train the classifier, the completion of the entire process (to discretize data, compute the classifier, and predict test instances) is often faster than the NB-based classification of the continuous instances. This is because the time to classify test instances is an important factor that is positively influenced by discretization. The biggest positive influence, both on the accuracy and the classification time, is associated with the MODL, FCAIM, and CAIM algorithms.

Type
Articles
Copyright
Copyright © Cambridge University Press 2010

Access options

Get access to the full version of this content by using one of the access options below. (Log in options will check for institutional or personal access. Content may require purchase if you do not have access.)

References

Abraham, R., Simha, J. B., Iyengar, S. S. 2006. A comparative analysis of discretization methods for medical datamining with naïve bayesian classifier. In ICIT ’06: Proceedings of the 9th International Conference on Information Technology. IEEE Computer Society, Washington, DC, USA, 235–236.CrossRefGoogle Scholar
Alpaydin, E. 1999. Combined 5×2 cv f test for comparing supervised classification learning algorithms. Neural Computation 11(8), 18851892.CrossRefGoogle Scholar
Asuncion, A., Newman, D. 2007. UCI Machine Learning Repository. University of California, Irvine, School of Information and Computer Sciences, http://www.ics.uci.edu/~mlearn/MLRepository.html.Google Scholar
Boullé, M. 2004. Khiops: a statistical discretization method of continuous attributes. Machine Learning 55(1), 5369.CrossRefGoogle Scholar
Boullé, M. 2006. MODL: a bayes optimal discretization method for continuous attributes. Machine Learning 65(1), 131165.CrossRefGoogle Scholar
Catlett, J. 1991. On changing continuous attributes into ordered discrete attributes. In EWSL ‘91: Proceedings of the European Working Session on Machine Learning. Springer-Verlag, London, UK, 164–178.Google Scholar
Ching, J., Wong, A., Chan, K. 1995. Class-dependent discretization for inductive learning from continuous and mixed mode data. IEEE Transactions Pattern Analysis and Machine Intelligence 17(7), 641651.CrossRefGoogle Scholar
Cios, K. J., Kurgan, L. A. 2002. Hybrid Inductive Machine Learning: An Overview of CLIP Algorithms. Physica-Verlag GmbH.Google Scholar
Cios, K. J., Kurgan, L. A. 2004. Clip4: hybrid inductive machine learning algorithm that generates inequality rules. Information Sciences 163(1–3), 3783.CrossRefGoogle Scholar
Clark, P., Niblett, T. 1989. The cn2 induction algorithm. Machine Learning 3(4), 261283.CrossRefGoogle Scholar
Demšar, J. 2006. Statistical comparisons of classifiers over multiple data sets. Journal of Machine Learning Research 7, 130.Google Scholar
Dougherty, J., Kohavi, R., Sahami, M. 1995. Supervised and unsupervised discretization of continuous features. In Proceedings of 12th International Conference Machine Learning. Morgan Kaufmann, 194–202.Google Scholar
Fayyad, U. M., Irani, K. B. 1992. On the handling of continuous-valued attributes in decision tree generation. Machine Learning 8(1), 87102.CrossRefGoogle Scholar
Fayyad, U. M., Irani, K. B. 1993. Multi-interval discretization of continuous-valued attributes for classification learning. In Proceedings of the International Joint Conference on Uncertainty in AI. Morgan Kaufmann, San Francisco, CA, USA, 1022–1027.Google Scholar
Flores, J. L., Inza, I., Larran, P. 2007. Wrapper discretization by means of estimation of distribution algorithms. Intelligent Data Analysis 11(5), 525545.CrossRefGoogle Scholar
Friedman, N., Geiger, D., Goldszmidt, M., Provan, G., Langley, P., Smyth, P. 1997. Bayesian network classifiers. Machine Learning 29, 131163.CrossRefGoogle Scholar
Huang, W. 1996. Discretization of Continuous Attributes for Inductive Machine Learning. MSc thesis. Department of Computer Science, University of Toledo, Ohio, USA.Google Scholar
Iman, R. L., Davenport, J. M. 1980. Approximations of the critical region of the Friedman statistic. Communications in Statistics A9, 571595.CrossRefGoogle Scholar
Jiang, L., Zhang, H. 2006. Weightily averaged one-dependence estimators. In Proceedings of the 9th Biennial Pacific Rim International Conference on Artificial Intelligence. Morgan Kaufmann, 970–974.Google Scholar
John, G., Langley, P. 1995. Estimating continuous distributions in bayesian classifiers. In Proceedings of the Eleventh Conference on Uncertainty in Artificial Intelligence. Morgan Kaufmann, 338–345.Google Scholar
Kaufman, K. A., Michalski, R. S. 1999. Learning from inconsistent and noisy data: the aq18 approach. In Proceedings of the 11th International Symposium Methodologies for Intelligent Systems, Saratoga Springs, NY, May 2005.Google Scholar
Keogh, E. J., Pazzani, M. J. 1999. Learning augmented bayesian classifiers: A comparison of distribution-based and classification-based approaches. In Proceedings of The Seventh International Workshop on Artificial Intelligence and Statistics, Fort Lauderdale, Florida, USA, 3–6 January, 225–230.Google Scholar
Kerber, R. 1992. Chimerge: discretization of numeric attributes. In Proceedings of the 9th International Conference of Artificial Intelligence, Cambridge, UK, 20–22 February, 123–128.Google Scholar
Kohavi, R., Sahami, M. 1996. Error-based and entropy-based discretization of continuous features. In Proceedings of the 2nd International Conference Knowledge Discovery and Data Mining, Portland, Oregon, USA, 2–4 August, 114–119.Google Scholar
Kujala, J., Elomaa, T. 2007. Improved algorithms for univariate discretization of continuous features. In PKDD 2007: Proceedings of the 11th European Conference on Principles and Practice of Knowledge Discovery in Databases, Warsaw, Poland, 17–21 September, 188–199.Google Scholar
Kurgan, L. A., Cios, K. J. 2001. Discretization algorithm that uses class-attribute interdependence maximization. In Proceedings of the 2001 International Conference on Artificial Intelligence, Seattle, Washington, USA, 4–10 August, 980–987.Google Scholar
Kurgan, L. A., Cios, K. J. 2003. Fast class-attribute interdependence maximization (CAIM) discretization algorithm. In Proceeding of International Conference on Machine Learning and Applications, Los Angeles, California, USA, 23–24 June, 30–36.Google Scholar
Kurgan, L. A., Cios, K. J. 2004. CAIM discretization algorithm. IEEE Transactions on Knowledge and Data Engineering 16, 145153.CrossRefGoogle Scholar
Kurgan, L. A., Cios, K. J., Dick, S. 2006. Highly scalable and robust rule learner: performance evaluation and comparison. IEEE Transactions on Systems, Man, and Cybernetics, Part B 36(1), 3253.CrossRefGoogle ScholarPubMed
Langley, P., Iba, W., Thompson, K. 1992. An analysis of bayesian classifiers. In Proceedings of the Tenth Conference on Artificial Intelligence. MIT Press, 223–228.Google Scholar
Langley, P., Sage, S. 1994. Induction of selective bayesian classifiers. In Proceedings of the Tenth Conference on Uncertainty in Artificial Intelligence. Morgan Kaufmann, 399–406.Google Scholar
Lee, C.-H. 2007. A hellinger-based discretization method for numeric attributes in classification learning. Knowledge-Based Systems 20(4), 419425.CrossRefGoogle Scholar
Liu, H., Setiono, R. 1997. Feature selection via discretization. IEEE Transactions on Knowledge and Data Engineering 9, 642645.Google Scholar
Liu, H., Hussain, F., Tan, C., Dash, M. 2002. Discretization: an enabling technique. Data Mining and Knowledge Discovery 6(4), 393423.CrossRefGoogle Scholar
Liu, X., Wang, H. 2005. A discretization algorithm based on a heterogeneity criterion. IEEE Transactions on Knowledge and Data Engineering 17(9), 11661173.CrossRefGoogle Scholar
Mehta, S., Parthasarathy, S., Yang, H. 2005. Toward unsupervised correlation preserving discretization. IEEE Transactions on Knowledge and Data Engineering 17(9), 11741185.CrossRefGoogle Scholar
Mizianty, M. J., Kurgan, L. A., Ogiela, M. R. 2008. Comparative analysis of the impact of discretization on the classification with naïve bayes and semi-naïve bayes classifiers. In ICMLA ’08: Proceedings of the 2008 Seventh International Conference on Machine Learning and Applications, San Diego, California, USA, 11–13 December, 823–828.Google Scholar
Nemenyi, P. 1963. Distribution-free Multiple Comparisons. PhD thesis, Princeton University.Google Scholar
Paterson, A., Niblett, T. 1987. Acls Manual. Intelligent Terminals, Ltd.Google Scholar
Quinlan, J. R. 1993. C4.5: Programs for Machine Learning. Morgan Kaufmann Publishers Inc.Google Scholar
Rissanen, J. 1978. Modeling by shortest data description. Automatica 14, 465471.CrossRefGoogle Scholar
Tay, F. E. H., Shen, L. 2002. A modified chi2 algorithm for discretization. IEEE Transactions on Knowledge and Data Engineering 14(3), 666670.CrossRefGoogle Scholar
Tsai, C.-J., Lee, C.-I., Yang, W.-P. 2008. A discretization algorithm based on class-attribute contingency coefficient. Information Sciences 178(3), 714731.CrossRefGoogle Scholar
Wang, K., Liu, B. 1998. Concurrent discretization of multiple attributes. In Proceedings of the 5th Pacific Rim International Conference on Artificial Intelligence, Singapore, 22–27 November, 250–259.Google Scholar
Wang, Z., Webb, G. I. 2002. Comparison of lazy bayesian rule and tree-augmented bayesian learning. In Proceedings of the 2002 IEEE International Conference on Data Mining, IEEE Computer Society. Washington, DC, USA, 490.Google Scholar
Webb, G. I., Boughton, J. R., Wang, Z. 2005. Not so naïve bayes: aggregating one-dependence estimators. Machine Learning 58, 524.CrossRefGoogle Scholar
Winter, R., Auerbach, K. 2004. Contents under Pressure. Intelligent Enterprise. http://www.intelligententerprise.com/showArticle.jhtml;?articleID=18902161.Google Scholar
Witten, I. H., Frank, E. 2005. Data Mining: Practical Machine Learning Tools and Techniques, 2nd edn. Morgan Kaufmann Publishers Inc.Google Scholar
Wong, A. K. C., Chiu, D. K. Y. 1987. Synthesizing statistical knowledge from incomplete mixed-mode data. IEEE Transactions on Pattern Analysis and Machine Intelligence 9(6), 796805.CrossRefGoogle ScholarPubMed
Wong, A. K. C., Liu, T. S. 1975. Typicality, diversity, and feature pattern of an ensemble. IEEE Transactions on Computers 24(2), 158181.CrossRefGoogle Scholar
Wu, X., Kumar, V., Ross, Q. J., Ghosh, J., Yang, Q., Motoda, H., McLachlan, G. J., Ng, A., Liu, B., Yu, P. S., Zhou, Z.-H., Steinbach, M., Hand, D. J., Steinberg, D. 2007. Top 10 algorithms in data mining. Knowledge and Information Systems 14(1), 137.CrossRefGoogle Scholar
Yang, Y., Webb, G. I. 2002. A comparative study of discretization methods for naive-bayes classifiers. In Proceedings of the 2002 Pacific Rim Knowledge Acquisition Workshop, Tokyo, Japan, 18–19 August, 159–173.Google Scholar
Zhang, H., Jiang, L., Su, J. 2005. Hidden naive bayes. In Proceedings of the 20th National Conference on Artificial intelligence. AAAI Press, 919–924.Google Scholar
Zheng, Z., Webb, G. I. 2000. Lazy learning of bayesian rules. Machine Learning 41(1), 5384.CrossRefGoogle Scholar
Zheng, F., Webb, G. I. 2006. Efficient lazy elimination for averaged one-dependence estimators. In Proceedings of the 23rd international conference on Machine learning. ACM, 1113–1120.Google Scholar