GP Classification under Imbalanced Data sets: Active Sub-sampling and AUC Approximation * John Doucette and Malcolm I. Heywood January 5, 2009 Abstract The problem of evolving binary classification models under increas- ingly unbalanced data sets is approached by proposing a strategy con- sisting of two components: Sub-sampling and ‘robust’ fitness function design. In particular, recent work in the wider machine learning litera- ture has recognized that maintaining the original distribution of exemplars during training is often not appropriate for designing classifiers that are robust to degenerate classifier behavior. To this end we propose a ‘Sim- ple Active Learning Heuristic’ (SALH) in which a subset of exemplars is sampled with uniform probability under a class balance enforcing rule for fitness evaluation. In addition, an efficient estimator for the Area Under the Curve (AUC) performance metric is assumed in the form of a modified Wilcoxon-Mann-Whitney (WMW) statistic. Performance is evaluated in terms of six representative UCI data sets and benchmarked against: canonical GP, SALH based GP, SALH and the modified WMW statistic, and deterministic classifiers (Naive Bayes and C4.5). The result- ing SALH-WMW model is demonstrated to be both efficient and effective at providing solutions maximizing performance assessed in terms of AUC. 1 Introduction Genetic Programming (GP) provides many unique opportunities for posing so- lutions to the basic Machine Learning design questions of representation, cost function, and credit assignment. In this work we are specifically interested in the topic of cost function design under the classification domain of supervised learning. Classically, an equally weighted cost function is assumed, such as ‘hits’ [11] or sum square error [2]. Such a design choice might be natural under bal- anced binary classification problems where each class carries an equal risk, but is questionable in the wider context of real world data sets that are frequently * Published in the Proceedings of the European Conference on Genetic Programming (Eu- roGP), 2008. Lecture Notes in Computer Science, Vol. 4971. Copyright Springer-Verlag. J. Doucette and M.I. Heywood are with the Faculty of Computer Science, Dalhousie University, 6050 University Av., Halifax, NS, B3H 1W5, Canada. 1