Chapter 40 DATA MINING FOR IMBALANCED DATASETS: AN OVERVIEW Nitesh V. Chawla Department of Computer Science and Engineering University of Notre Dame IN 46530, USA Abstract A dataset is imbalanced if the classification categories are not approximately equally represented. Recent years brought increased interest in applying ma- chine learning techniques to difficult "real-world" problems, many of which are characterized by imbalanced data. Additionally the distribution of the testing data may differ from that of the training data, and the true misclassification costs may be unknown at learning time. Predictive accuracy, a popular choice for evaluating performance of a classifier, might not be appropriate when the data is imbalanced andlor the costs of different errors vary markedly. In this Chapter, we discuss some of the sampling techniques used for balancing the datasets, and the performance measures more appropriate for mining imbalanced datasets. Keywords: imbalanced datasets, classification, sampling, ROC, cost-sensitive measures, precision and recall 1 Introduction The issue with imbalance in the class distribution became more pronounced with the applications of the machine learning algorithms to the real world. These applicationsrange from telecommunications management (Ezawa et al., 1996), bioinformatics (Radivojac et al., 2004), text classification (Lewis and Catlett, 1994; Dumais et al., 1998; Mladeni6 and Grobelnik, 1999; Cohen, 1995b), speech recognition (Liu et al., 2004), to detection of oil spills in satel- lite images (Kubat et al., 1998). The imbalance can be an artifact of class distribution and/or different costs of errors or examples. It has received atten- tion from machine learning and Data Mining community in form of Workshops (Japkowicz, 2000b; Chawla et al., 2003a; Dietterich et al., 2003; Fem et al.,