Scaling up classification rule induction through parallel processing

Frederic Stahl; Max Bramer

doi:10.1017/S0269888912000355

Scaling up classification rule induction through parallel processing

Published online by Cambridge University Press: 26 November 2012

Frederic Stahl and

Max Bramer

Show author details

Frederic Stahl: Affiliation:
School of Systems Engineering, University of Reading, Whiteknights, Reading RG6 6AY, UK; e-mail: [email protected]
Max Bramer: Affiliation:
School of Computing, University of Portsmouth, Buckingham Building, Lion Terrace, PO1 3HE Portsmouth, UK; e-mail: [email protected]

Article contents

Abstract
References

Get access

Rights & Permissions

Abstract

The fast increase in the size and number of databases demands data mining approaches that are scalable to large amounts of data. This has led to the exploration of parallel computing technologies in order to perform data mining tasks concurrently using several processors. Parallelization seems to be a natural and cost-effective way to scale up data mining technologies. One of the most important of these data mining technologies is the classification of newly recorded data. This paper surveys advances in parallelization in the field of classification rule induction.

Type: Articles
Information: The Knowledge Engineering Review , Volume 28 , Issue 4 , December 2013 , pp. 451 - 478

DOI: https://doi.org/10.1017/S0269888912000355 [Opens in a new window]
Copyright: Copyright © Cambridge University Press 2012

Access options

Get access to the full version of this content by using one of the access options below. (Log in options will check for institutional or personal access. Content may require purchase if you do not have access.)

Article purchase

Temporarily unavailable

References

Berrar, D., Stahl, F., Silva, C. S. G., Rodrigues, J. R., Brito, R. M. M., Dubitzky, W. 2005. Towards data warehousing and mining of protein unfolding simulation data. Journal of Clinical Monitoring and Computing 19, 307–317.CrossRef Google Scholar PubMed

Bramer, M. A. 2000. Automatic induction of classification rules from examples using N-Prism. In Research and Development in Intelligent Systems XVI, Bramer, M. A., Macintosh, A. & Coenen, F. (eds). Springer-Verlag, 99–121.CrossRef Google Scholar

Bramer, M. A. 2002. An information-theoretic approach to the pre-pruning of classification rules. In Intelligent Information Processing, Musen, B. N. M. & Studer, R. (eds). Kluwer, 201–212.CrossRef Google Scholar

Bramer, M. A. 2005. Inducer: a public domain workbench for data mining. International Journal of Systems Science 36(14), 909–919.CrossRef Google Scholar

Bramer, M. A. 2007. Principles of Data Mining. Springer.Google Scholar

Breiman, L. 1996. Bagging predictors. Machine Learning 24(2), 123–140.CrossRef Google Scholar

Breiman, L. 2001. Random forests. Machine Learning 45(1), 5–32.CrossRef Google Scholar

Breiman, L., Friedman, J. H., Olshen, R. A., Stone, C. J. 1984. Classification and regression trees. Wadsworth Publishing Company.Google Scholar

Caragea, D., Silvescu, A., Honavar, V. 2003. Decision tree induction from distributed heterogeneous autonomous data sources. In Proceedings of the Conference on Intelligent Systems Design and Applications (ISDA 03). Springer-Verlag, 341–350.Google Scholar

Catlett, J. 1991. Megainduction: Machine Learning on Very Large Databases. Unpublished doctoral dissertation, University of Technology Sydney.Google Scholar

Cendrowska, J. 1987. PRISM: an algorithm for inducing modular rules. International Journal of Man–Machine Studies 27, 349–370.CrossRef Google Scholar

Chan, P., Stolfo, S. J. 1993a. Experiments on multistrategy learning by meta learning. In Proceedings of 2nd International Conference on Information and Knowledge Management, Washington, DC, United States, 314–323.CrossRef Google Scholar

Chan, P., Stolfo, S. J. 1993b. Meta-Learning for multi strategy and parallel learning. In Proceedings of 2nd International Workshop on Multistrategy Learning, Harpers Ferry, West Virginia United States, 150–165.Google Scholar

Clark, P., Niblett, T. 1989. The CN2 induction algorithm. Machine Learning 3(4), 261–283.CrossRef Google Scholar

Cohen, W. W. 1995. Fast effective rule induction. In Proceedings of the 12th International Conference on Machine Learning. Morgan Kaufmann, 115–123.CrossRef Google Scholar

Erman, L. D., Hayes-Roth, F., Lesser, V. R., Reddy, D. R. 1980. The Hearsay-II Speech-Understanding system: integrating knowledge to resolve uncertainty. ACM Computing Surveys (CSUR) 12(2), 213–253.CrossRef Google Scholar

Freitas, A. 1998. A survey of parallel data mining. In Proceedings of the 2nd International Conference on the Practical Applications of Knowledge Discovery and Data Mining, London, 287–300.Google Scholar

Frey, L. J., Fisher, D. H. 1999. Modelling decision tree performance with the power law. In Proceedings of the 7th International Workshop on Artificial Intelligence and Statistics, Fort Lauderdale, Florida, USA, 59–65.Google Scholar

Fuernkranz, J. 1998. Integrative windowing. Journal of Artificial Intelligence Research 8, 129–164.CrossRef Google Scholar

Goldberg, D. 1989. Genetic Algorithms in Search, Optimization and Machine Learning. Addison-Wesley.Google Scholar

GSC-II 2012. (http://tdc-www.harvard.edu/catalogs/gsc2.html).Google Scholar

Han, J., Kamber, M. 2001. Data Mining: Concepts and Techniques. Morgan Kaufmann.Google Scholar

Hillis, W., Steele, L. 1986. Data parallel algorithms. Communications of the ACM 29(12), 1170–1183.CrossRef Google Scholar

Ho, T. K. 1995. Random decision forests. Proceedings of the 3rd International Conference on Document Analysis and Recognition, Montreal, Canada, 1, 278.Google Scholar

Hunt, E. B., Stone, P. J., Marin, J. 1966. Experiments in Induction. Academic Press.Google Scholar

Joshi, M., Karypis, G., Kumar, V. 1998. Scalparc: a new scalable and efficient parallel classification algorithm for mining large datasets. In Proceedings of the 1st Merged International Parallel Processing Symposium and Symposium on Parallel and Distributed Processing, IPPS/SPDP 1998, Orlando, Florida, 573–579.Google Scholar

Kargupta, H., Park, B. H., Hershberger, D., Johnson, E. 1999. Collective data mining: a new perspective toward distributed data analysis. In Advances in Distributed and Parallel Knowledge Discovery, Kargupta, H. & Chan, P. (eds). AAAI/MIT Press, 133–184.Google Scholar

Kerber, R. 1992. Chimerge: discretization of numeric attributes. In Proceedings of the AAAI, San Jose, California, 123–128.Google Scholar

Lippmann, R. P. 1988. An introduction to computing with neural nets. SIGARCH Computer Architecture News 16(1), 7–25.CrossRef Google Scholar

Metha, M., Agrawal, R., Rissanen, J. 1996. SLIQ: a fast scalable classifier for data mining. In Proceedings of the 5th International Conference on Extending Database Technology: Advances in Database Technology. Springer, 1057, 18–32.Google Scholar

Michalski, R. S. 1969. On the Quasi-Minimal solution of the general covering problem. In Proceedings of the 5th International Symposium on Information Processing, Bled, Yugoslavia, 125–128.Google Scholar

Minitab 2010. (http://www.minitab.com/).Google Scholar

Park, B., Kargupta, H. 2002. Distributed data mining: algorithms, systems and applications. In Data Mining Handbook. IEA, 341–358.Google Scholar

Provost, F. 2000. Distributed data mining: scaling up and beyond. In Advances in Distributed and Parallel Knowledge Discovery, Kargupta, H. & Chan, P. (eds). MIT Press, 3–27.Google Scholar

Provost, F., Hennessy, D. N. 1994. Distributed machine learning: scaling up with coarse-grained parallelism. In Proceedings of the 2nd International Conference on Intelligent Systems for Molecular Biology, Stanford, California, 340–347.Google Scholar

Provost, F., Hennessy, D. N. 1996. Scaling up: distributed machine learning with cooperation. In Proceedings of the 13th National Conference on Artificial Intelligence. AAAI Press, 74–79.Google Scholar

Provost, F., Jensen, D., Oates, T. 1999. Efficient progressive sampling. In Proceedings of the International Conference on Knowledge Discovery and Data Mining. ACM, 23–32.Google Scholar

Quinlan, R. J. 1979a. Discovering rules by induction from large collections of examples. In Expert Systems in the Micro-Electronic Age. Edinburgh University Press.Google Scholar

Quinlan, R. J. 1979b. Induction Over Large Databases. Michie, D. (ed.). Technical No. STAN-CS-739, Stanford University, 168–201.Google Scholar

Quinlan, R. J. 1983. Learning efficient classification procedures and their applications to chess endgames. In Machine Learning: An AI Approach, Michalski, R. S., Carbonell, J. G. & Mitchell, T. M. (eds). Morgan Kaufmann, 463–482.Google Scholar

Quinlan, R. J. 1986. Induction of decision trees. Machine Learning 1(1), 81–106.CrossRef Google Scholar

Quinlan, R. J. 1993. C4.5: Programs for Machine Learning. Morgan Kaufmann.Google Scholar

SAS/STAT 2010. (http://www.sas.com/).Google Scholar

Segal, M. R. 2004. Machine Learning Benchmarks and Random Forest Regression. Center for Bioinformatics & Molecular Biostatistics, University of California.Google Scholar

Shafer, J., Agrawal, R., Metha, M. 1996. SPRINT: a scalable parallel classifier for data mining. In Proceedings of the 22nd International Conference on Very Large Databases. Morgan Kaufmann, 544–555.Google Scholar

Shannon, C. E. 1948. A mathematical theory of communication. The Bell System Technical Journal 27.CrossRef Google Scholar

Sirvastava, A., Han, E., Kumar, V., Singh, V. 1999. Parallel formulations of Decision-Tree classification algorithms. Data Mining and Knowledge Discovery 3, 237–261.CrossRef Google Scholar

Smyth, P., Goodman, R. M. 1992. An information theoretic approach to rule induction from databases. Transactions on Knowledge and Data Engineering 4(4), 301–316.CrossRef Google Scholar

Stahl, F. 2009. Parallel Rule Induction. Unpublished doctoral dissertation, University of Portsmouth.Google Scholar

Stahl, F., Berrar, D., Silva, C. S. G., Rodrigues, J. R., Brito, R. M. M., Dubitzky, W. 2005. Grid warehousing of molecular dynamics protein unfolding data. In Proceedings of the 15th IEEE/ACM International Symposium on Cluster Computing and the Grid. IEEE/ACM, 496–503.Google Scholar

Stahl, F., Bramer, M., Adda, M. 2008. Parallel induction of modular classification rules. In Proceedings of SGAI Conference (p. lookup-lookup). Springer.CrossRef Google Scholar

Stahl, F., Bramer, M., Adda, M. 2009a. Parallel rule induction with information theoretic pre-pruning. In Proceedings of the SGAI Conference, 151–164.Google Scholar

Stahl, F., Bramer, M. A., Adda, M. 2009b. PMCRI: a parallel modular classification rule induction framework. In Proceedings of MLDM. Springer, 148–162.Google Scholar

Stahl, F., Bramer, M., Adda, M. 2010. J-PMCRI: a methodology for inducing pre-pruned modular classification rules. In Artificial Intelligence in Theory and Practice III, Bramer, M. A. (ed.). Springer, 47–56.CrossRef Google Scholar

Stankovski, V., Swain, M., Kravtsov, V., Niessen, T., Wegener, D., Roehm, M. 2008. Digging deep into the data mine with DataMiningGrid. IEEE Internet Computing 12, 69–76.CrossRef Google Scholar

Szalay, A. 1998. The Evolving Universe. ASSL 231.Google Scholar

Way, J., Smith, E. A. 1991. The evolution of synthetic aperture radar systems and their progression to the EOS SAR. IEEE Transactions on Geoscience and Remote Sensing 29(6), 962–985.CrossRef Google Scholar

Wirth, J., Catlett, J. 1988. Experiments on the costs and benefits of windowing in ID3. In Proceedings of the 5th International Conference on Machine Learning(ML-88). Morgan Kaufmann, 87–95.Google Scholar

Witten, I. H., Eibe, F. 1999. Data mining: practical machine learning tools and techniques with java implementations. Morgan Kaufmann.Google Scholar

Article contents

Scaling up classification rule induction through parallel processing

Abstract

Access options

Article purchase

Temporarily unavailable

References

Save article to Kindle

Save article to Dropbox

Save article to Google Drive

Reply to: Submit a response

Your details

You have entered the maximum number of contributors

Conflicting interests