Classiﬁcation on Imbalanced Data Sets, Taking Advantage of Errors to Improve Performance Asdrúbal López-Chau 1(&) , Farid García-Lamont 2 , and Jair Cervantes 2 1 Centro Universitario UAEM, Universidad Autónoma Del Estado de México, CP 55600 Zumpango, Estado de Mexico, México alchau@uaemex.mx 2 Centro Universitario UAEM, Universidad Autónoma Del Estado de México, 56159 Texcoco, Estado de Mexico, México Abstract. Classiﬁcation methods usually exhibit a poor performance when they are applied on imbalanced data sets. In order to overcome this problem, some algorithms have been proposed in the last decade. Most of them generate synthetic instances in order to balance data sets, regardless the classiﬁcation algorithm. These methods work reasonably well in most cases; however, they tend to cause over-ﬁtting. In this paper, we propose a method to face the imbalance problem. Our approach, which is very simple to implement, works in two phases; the ﬁrst one detects instances that are dif ﬁcult to predict correctly for classiﬁcation methods. These instances are then categorized into “noisy” and “secure”, where the for- mer refers to those instances whose most of their nearest neighbors belong to the opposite class. The second phase of our method, consists in generating a number of synthetic instances for each one of those that are dif ﬁcult to predict correctly. After applying our method to data sets, the AUC area of classiﬁers is improved dramatically. We compare our method with others of the state-of-the-art, using more than 10 data sets. Keywords: Imbalanced  Classiﬁcation  Synthetic instances 1 Introduction Achieving a good performance on imbalanced data sets is a challenging task for classiﬁcation methods [3]. They usually focus on majority class, almost ignoring the opposite class [8]. Currently, there are many real-world applications that generate this type of data sets, for example: software defect detection [6], medical diagnosis [1], fraud detection in telecommunications [4], ﬁnancial risks [7] and DNA sequencing [9], among others. In this type of applications, there are two objectives in conﬂict, on the one hand, for the classiﬁer should be more important to predict the minority class instances with the minimal errors, and on the other hand, the classiﬁcation accuracy for majority class instances should not be severely damaged. The AUC ROC measure is one of the most widely used to capture this requirement. The problem of classiﬁcation on imbalanced data sets has attracted the attention of the machine learning and data mining communities in the last past few years [2]. The state-of-the-art methods to deal this problem can be categorized into: © Springer International Publishing Switzerland 2015 D.-S. Huang and K. Han (Eds.): ICIC 2015, Part III, LNAI 9227, pp. 72–78, 2015. DOI: 10.1007/978-3-319-22053-6_8