Combining symbolic classiﬁers from multiple inducers Jose ´ Augusto Baranauskas * , Maria Carolina Monard Laboratory of Computational Intelligence, Department of Computer Science and Statistics, Institute of Mathematics and Computer Sciences, University of Sa ˜o Paulo, Av. Trabalhador Socarlense, 400 P.O. Box 668, Sao Carlos, Sao Paulo 13566-590, Brazil Received 8 August 2001; accepted 4 February 2002 Abstract Classiﬁcation algorithms for large databases have many practical applications in data mining. Whenever a dataset is too large for a particular learning algorithm to be applied, sampling can be used to scale up classiﬁers to massive datasets. One general approach associated with sampling is the construction of ensembles. Although beneﬁts in accuracy can be obtained from the use of ensembles, one problem is their interpretability. This has motivated our work on trying to use the beneﬁts of combining symbolic classiﬁers, while still keeping the symbolic component in the learning system. This idea has been implemented in the XRULER system. We describe the XRULER system, as well as experiments performed to evaluate it on 10 datasets. The results show that it is possible to combine symbolic classiﬁers into a ﬁnal symbolic classiﬁer with increase in the accuracy and decrease in the number of ﬁnal rules. q 2002 Elsevier Science B.V. All rights reserved. Keywords: Combining classiﬁers; Machine Learning; Data Mining 1. Introduction Usual approaches for developing a knowledge base involve a manual formalization of expert’s knowledge and encoding it in appropriate data structures. One important application of Machine Learning (ML) concerns the (semi) automatic construction of knowledge bases through induc- tive inference. Indeed, ML can provide an improvement of current techniques and a basis for developing alternative knowledge acquisition approaches. One challenge in predictive Data Mining (DM) is to exploit and combine existing ML algorithms effectively. However, not all ML algorithms are designed to deal with big data, an important issue in DM. One solution to this problem is to take several samples from the database inducing knowledge using these samples as input to any ML algorithm. After that, it is possible to use the ensemble approach to classify new instances [5,9]. Although beneﬁts in accuracy can be obtained from the use of ensembles, one problem is their interpretability, since combining symbolic classiﬁers into one single classiﬁer by a majority (or other) vote mechanism does not result into a symbolic classiﬁer, i.e. an ensemble of symbolic classiﬁers is no longer symbolic. In this context, a symbolic classiﬁer is the one that can be transformed into a set of rules. In our work, we are interested in interpretability, i.e. our main concern is not only related to the correct classiﬁcation of new instances, but also with explaining the user the reasons of classifying a new instance into one of the possible classes. In other words, we are interested in symbolic classiﬁers, speciﬁcally in classiﬁers that can be transformed into a set of rules. In this work, we propose and describe the XRULER system that can be used to combine symbolic knowledge extracted from different samples using different inducers. The main idea is to transform each symbolic classiﬁer into a standard if – then rule format before trying to merge the classiﬁers into a ﬁnal one. These ideas are currently being implemented in the DISCOVER project, a under develop- ment computational environment designed as a generic knowledge discovery tool for DM research. This work is organized as follows. Section 2 introduces the notation and deﬁnitions used in the remaining text. Section 3 provides a description of the XRULER system, including the basic cover algorithm, criteria for choosing the best rule, as well as classifying new instances. Section 4 outlines the experimental setup used to evaluate the XRULER system and Section 5 reports the experimental results and discussion. Finally, Section 6 presents some concluding remarks. 0950-7051/03/$ - see front matter q 2002 Elsevier Science B.V. All rights reserved. PII: S0950-7051(02)00021-7 Knowledge-Based Systems 16 (2003) 129–136 www.elsevier.com/locate/knosys * Corresponding author. E-mail addresses: augusto@fmrp.usp.br (J.A. Baranauskas); mcmonard@icmc.usp.br (M.C. Monard).