Learning probabilistic decision trees for AUC Harry Zhang * , Jiang Su Faculty of Computer Science, University of New Brunswick, P.O. Box 4400, Fredericton, NB, Canada E3B 5A3 Available online 13 December 2005 Abstract Accurate ranking, measured by AUC (the area under the ROC curve), is crucial in many real-world applications. Most traditional learning algorithms, however, aim only at high classiﬁcation accuracy. It has been observed that traditional decision trees produce good classiﬁcation accuracy but poor probability estimates. Since the ranking generated by a decision tree is based on the class probabilities, a probability estimation tree (PET) with accurate probability estimates is desired in order to yield high AUC. Some researchers ascribe the poor probability estimates of decision trees to the decision tree learning algorithms. To our observation, however, the representation also plays an important role. In this paper, we propose to extend decision trees to represent a joint distribution and conditional independence, called conditional independence trees (CITrees), which is a more suitable model for yielding high AUC. We propose a novel AUC-based algorithm for learning CITrees, and our experiments show that the CITree algorithm outperforms the state-of-the-art decision tree learn- ing algorithm C4.4 (a variant of C4.5), naive Bayes, and NBTree in AUC. Our work provides an eﬀective model and algorithm for appli- cations in which an accurate ranking is required. Ó 2005 Elsevier B.V. All rights reserved. Keywords: Decision trees; AUC; Naive Bayes; Ranking 1. Introduction Classiﬁcation is one of the most important tasks in machine learning and pattern recognition. In classiﬁcation, a classiﬁer is built from a set of training examples with class labels. A key performance measure of a classiﬁer is its pre- dictive accuracy (or error rate, 1  accuracy). Many classi- ﬁers can also produce the class probability estimates p(cjE) that is the probability of an example E in the class c. How- ever, this information is largely ignored—the error rate does not consider how ‘‘far-oﬀ’’ (be it 0.45 or 0.01) the pre- diction of each example is from its target, but only the class with the largest probability estimate. In many applications, however, classiﬁcation and error rate are not enough. For example, in direct marketing, we often need to promote the top X% of customers during gradual roll-out, or we often deploy diﬀerent promotion strategies to customers with diﬀerent likelihoods of buying some products. To accomplish these tasks, we need more than a mere classiﬁcation of buyers and non-buyers. We need (at least) a ranking of customers in terms of their like- lihoods of buying. Thus, a ranking is much more desirable than just a classiﬁcation. If we are aiming at accurate ranking from a classiﬁer, one might naturally think that we must need the true rank- ing of the training examples. In most scenarios, however, that is not possible. Most likely, what we are given is a data set of examples with class labels. Fortunately, when only a training set with class labels is given, the area under ROC (receiver operating characteristics) curve (Swets, 1988; Pro- vost and Fawcett, 1997), or simply AUC, can be used to evaluate classiﬁers that also produce rankings. Hand and Till (2001) show that, for binary classiﬁcation, AUC is equivalent to the probability that a randomly chosen exam- ple of class  will have a smaller estimated probability of belonging to class + than a randomly chosen example of class +. They present a simple approach to calculating the AUC of a classiﬁer G below: 0167-8655/$ - see front matter Ó 2005 Elsevier B.V. All rights reserved. doi:10.1016/j.patrec.2005.10.013 * Corresponding author. Fax: +1 506 453 3566. E-mail address: hzhang@unb.ca (H. Zhang). www.elsevier.com/locate/patrec Pattern Recognition Letters 27 (2006) 892–899