Combining symbolic classifiers from multiple inducers Jose ´ Augusto Baranauskas * , Maria Carolina Monard Laboratory of Computational Intelligence, Department of Computer Science and Statistics, Institute of Mathematics and Computer Sciences, University of Sa ˜o Paulo, Av. Trabalhador Socarlense, 400 P.O. Box 668, Sao Carlos, Sao Paulo 13566-590, Brazil Received 8 August 2001; accepted 4 February 2002 Abstract Classification algorithms for large databases have many practical applications in data mining. Whenever a dataset is too large for a particular learning algorithm to be applied, sampling can be used to scale up classifiers to massive datasets. One general approach associated with sampling is the construction of ensembles. Although benefits in accuracy can be obtained from the use of ensembles, one problem is their interpretability. This has motivated our work on trying to use the benefits of combining symbolic classifiers, while still keeping the symbolic component in the learning system. This idea has been implemented in the XRULER system. We describe the XRULER system, as well as experiments performed to evaluate it on 10 datasets. The results show that it is possible to combine symbolic classifiers into a final symbolic classifier with increase in the accuracy and decrease in the number of final rules. q 2002 Elsevier Science B.V. All rights reserved. Keywords: Combining classifiers; Machine Learning; Data Mining 1. Introduction Usual approaches for developing a knowledge base involve a manual formalization of expert’s knowledge and encoding it in appropriate data structures. One important application of Machine Learning (ML) concerns the (semi) automatic construction of knowledge bases through induc- tive inference. Indeed, ML can provide an improvement of current techniques and a basis for developing alternative knowledge acquisition approaches. One challenge in predictive Data Mining (DM) is to exploit and combine existing ML algorithms effectively. However, not all ML algorithms are designed to deal with big data, an important issue in DM. One solution to this problem is to take several samples from the database inducing knowledge using these samples as input to any ML algorithm. After that, it is possible to use the ensemble approach to classify new instances [5,9]. Although benefits in accuracy can be obtained from the use of ensembles, one problem is their interpretability, since combining symbolic classifiers into one single classifier by a majority (or other) vote mechanism does not result into a symbolic classifier, i.e. an ensemble of symbolic classifiers is no longer symbolic. In this context, a symbolic classifier is the one that can be transformed into a set of rules. In our work, we are interested in interpretability, i.e. our main concern is not only related to the correct classification of new instances, but also with explaining the user the reasons of classifying a new instance into one of the possible classes. In other words, we are interested in symbolic classifiers, specifically in classifiers that can be transformed into a set of rules. In this work, we propose and describe the XRULER system that can be used to combine symbolic knowledge extracted from different samples using different inducers. The main idea is to transform each symbolic classifier into a standard if – then rule format before trying to merge the classifiers into a final one. These ideas are currently being implemented in the DISCOVER project, a under develop- ment computational environment designed as a generic knowledge discovery tool for DM research. This work is organized as follows. Section 2 introduces the notation and definitions used in the remaining text. Section 3 provides a description of the XRULER system, including the basic cover algorithm, criteria for choosing the best rule, as well as classifying new instances. Section 4 outlines the experimental setup used to evaluate the XRULER system and Section 5 reports the experimental results and discussion. Finally, Section 6 presents some concluding remarks. 0950-7051/03/$ - see front matter q 2002 Elsevier Science B.V. All rights reserved. PII: S0950-7051(02)00021-7 Knowledge-Based Systems 16 (2003) 129–136 www.elsevier.com/locate/knosys * Corresponding author. E-mail addresses: augusto@fmrp.usp.br (J.A. Baranauskas); mcmonard@icmc.usp.br (M.C. Monard).