Learning probabilistic decision trees for AUC Harry Zhang * , Jiang Su Faculty of Computer Science, University of New Brunswick, P.O. Box 4400, Fredericton, NB, Canada E3B 5A3 Available online 13 December 2005 Abstract Accurate ranking, measured by AUC (the area under the ROC curve), is crucial in many real-world applications. Most traditional learning algorithms, however, aim only at high classification accuracy. It has been observed that traditional decision trees produce good classification accuracy but poor probability estimates. Since the ranking generated by a decision tree is based on the class probabilities, a probability estimation tree (PET) with accurate probability estimates is desired in order to yield high AUC. Some researchers ascribe the poor probability estimates of decision trees to the decision tree learning algorithms. To our observation, however, the representation also plays an important role. In this paper, we propose to extend decision trees to represent a joint distribution and conditional independence, called conditional independence trees (CITrees), which is a more suitable model for yielding high AUC. We propose a novel AUC-based algorithm for learning CITrees, and our experiments show that the CITree algorithm outperforms the state-of-the-art decision tree learn- ing algorithm C4.4 (a variant of C4.5), naive Bayes, and NBTree in AUC. Our work provides an effective model and algorithm for appli- cations in which an accurate ranking is required. Ó 2005 Elsevier B.V. All rights reserved. Keywords: Decision trees; AUC; Naive Bayes; Ranking 1. Introduction Classification is one of the most important tasks in machine learning and pattern recognition. In classification, a classifier is built from a set of training examples with class labels. A key performance measure of a classifier is its pre- dictive accuracy (or error rate, 1 accuracy). Many classi- fiers can also produce the class probability estimates p(cjE) that is the probability of an example E in the class c. How- ever, this information is largely ignored—the error rate does not consider how ‘‘far-off’’ (be it 0.45 or 0.01) the pre- diction of each example is from its target, but only the class with the largest probability estimate. In many applications, however, classification and error rate are not enough. For example, in direct marketing, we often need to promote the top X% of customers during gradual roll-out, or we often deploy different promotion strategies to customers with different likelihoods of buying some products. To accomplish these tasks, we need more than a mere classification of buyers and non-buyers. We need (at least) a ranking of customers in terms of their like- lihoods of buying. Thus, a ranking is much more desirable than just a classification. If we are aiming at accurate ranking from a classifier, one might naturally think that we must need the true rank- ing of the training examples. In most scenarios, however, that is not possible. Most likely, what we are given is a data set of examples with class labels. Fortunately, when only a training set with class labels is given, the area under ROC (receiver operating characteristics) curve (Swets, 1988; Pro- vost and Fawcett, 1997), or simply AUC, can be used to evaluate classifiers that also produce rankings. Hand and Till (2001) show that, for binary classification, AUC is equivalent to the probability that a randomly chosen exam- ple of class will have a smaller estimated probability of belonging to class + than a randomly chosen example of class +. They present a simple approach to calculating the AUC of a classifier G below: 0167-8655/$ - see front matter Ó 2005 Elsevier B.V. All rights reserved. doi:10.1016/j.patrec.2005.10.013 * Corresponding author. Fax: +1 506 453 3566. E-mail address: hzhang@unb.ca (H. Zhang). www.elsevier.com/locate/patrec Pattern Recognition Letters 27 (2006) 892–899