Classication on Imbalanced Data Sets, Taking Advantage of Errors to Improve Performance Asdrúbal López-Chau 1(&) , Farid García-Lamont 2 , and Jair Cervantes 2 1 Centro Universitario UAEM, Universidad Autónoma Del Estado de México, CP 55600 Zumpango, Estado de Mexico, México alchau@uaemex.mx 2 Centro Universitario UAEM, Universidad Autónoma Del Estado de México, 56159 Texcoco, Estado de Mexico, México Abstract. Classication methods usually exhibit a poor performance when they are applied on imbalanced data sets. In order to overcome this problem, some algorithms have been proposed in the last decade. Most of them generate synthetic instances in order to balance data sets, regardless the classication algorithm. These methods work reasonably well in most cases; however, they tend to cause over-tting. In this paper, we propose a method to face the imbalance problem. Our approach, which is very simple to implement, works in two phases; the rst one detects instances that are dif cult to predict correctly for classication methods. These instances are then categorized into noisyand secure, where the for- mer refers to those instances whose most of their nearest neighbors belong to the opposite class. The second phase of our method, consists in generating a number of synthetic instances for each one of those that are dif cult to predict correctly. After applying our method to data sets, the AUC area of classiers is improved dramatically. We compare our method with others of the state-of-the-art, using more than 10 data sets. Keywords: Imbalanced Classication Synthetic instances 1 Introduction Achieving a good performance on imbalanced data sets is a challenging task for classication methods [3]. They usually focus on majority class, almost ignoring the opposite class [8]. Currently, there are many real-world applications that generate this type of data sets, for example: software defect detection [6], medical diagnosis [1], fraud detection in telecommunications [4], nancial risks [7] and DNA sequencing [9], among others. In this type of applications, there are two objectives in conict, on the one hand, for the classier should be more important to predict the minority class instances with the minimal errors, and on the other hand, the classication accuracy for majority class instances should not be severely damaged. The AUC ROC measure is one of the most widely used to capture this requirement. The problem of classication on imbalanced data sets has attracted the attention of the machine learning and data mining communities in the last past few years [2]. The state-of-the-art methods to deal this problem can be categorized into: © Springer International Publishing Switzerland 2015 D.-S. Huang and K. Han (Eds.): ICIC 2015, Part III, LNAI 9227, pp. 7278, 2015. DOI: 10.1007/978-3-319-22053-6_8