An empirical comparison of cost-sensitive decision tree induction algorithms Susan Lomax and Sunil Vadera Data Mining & Pattern Recognition Research Centre, School of Computing, Science and Engineering, University of Salford, Salford M5 4WT, UK Email: s.e.lomax@edu.salford.ac.uk Abstract: Decision tree induction is a widely used technique for learning from data, which first emerged in the 1980s. In recent years, several authors have noted that in practice, accuracy alone is not adequate, and it has become increasingly important to take into consideration the cost of misclassifying the data. Several authors have developed techniques to induce cost-sensitive decision trees. There are many studies that include pair-wise comparisons of algorithms, but the comparison including many methods has not been conducted in earlier work. This paper aims to remedy this situation by investigating different cost-sensitive decision tree induction algorithms. A survey has identified 30 cost-sensitive decision tree algorithms, which can be organized into 10 categories. A representative sample of these algorithms has been implemented and an empirical evaluation has been carried. In addition, an accuracy-based look-ahead algorithm has been extended to a new cost-sensitive look-ahead algorithm and also evaluated. The main outcome of the evaluation is that an algorithm based on genetic algorithms, known as Inexpensive Classification with Expensive Tests, performed better over all the range of experiments thus showing that to make a decision tree cost-sensitive, it is better to include all the different types of costs, that is, cost of obtaining the data and misclassification costs, in the induction of the decision tree. Keywords: cost-sensitive learning, decision trees, data mining 1. Introduction Decision trees are a natural way of presenting a decision-making process, as they are simple and easy for anyone to understand (Quinlan, 1979). Learning them from data, however, is more complex, with the most common method to do this originally developed by Quinlan (1986) and known as ID3. This takes a table of examples as input, where each example consists of a collec- tion of attributes relating to that example, one attribute being the outcome (class) for that particular example. A divide and conquer tech- nique is used, splitting the data into subsets. Each node is a test on an attribute, each branch is the outcome of that test and at the end are leaf nodes indicating the class to which the example, when following that path, belongs. The data set is split into the training set, which builds the decision tree and then tested for accuracy using the remaining examples, known as a validation (or testing) set. The examples in the validation set would then be passed through the decision tree and examined for accuracy. However, in practice there are costs involved. It may cost money to obtain attribute values for the examples in the data set, for example, blood tests to be carried out (Quinlan et al., 1987). In addition, when examples are misclassified, they incur misclassification costs, which may, espe- cially when dealing with binary data sets have a variation between costs of false negatives DOI: 10.1111/j.1468-0394.2010.00573.x Article _____________________________ c 2011 Blackwell Publishing Ltd Expert Systems, July 2011, Vol. 28, No. 3 227