Classification on Imbalanced Data Sets, Taking
Advantage of Errors to Improve Performance
Asdrúbal López-Chau
1(&)
, Farid García-Lamont
2
, and Jair Cervantes
2
1
Centro Universitario UAEM, Universidad Autónoma Del Estado de México,
CP 55600 Zumpango, Estado de Mexico, México
alchau@uaemex.mx
2
Centro Universitario UAEM, Universidad Autónoma Del Estado de México,
56159 Texcoco, Estado de Mexico, México
Abstract. Classification methods usually exhibit a poor performance when
they are applied on imbalanced data sets. In order to overcome this problem,
some algorithms have been proposed in the last decade. Most of them generate
synthetic instances in order to balance data sets, regardless the classification
algorithm. These methods work reasonably well in most cases; however, they
tend to cause over-fitting.
In this paper, we propose a method to face the imbalance problem. Our
approach, which is very simple to implement, works in two phases; the first one
detects instances that are dif ficult to predict correctly for classification methods.
These instances are then categorized into “noisy” and “secure”, where the for-
mer refers to those instances whose most of their nearest neighbors belong to the
opposite class. The second phase of our method, consists in generating a number
of synthetic instances for each one of those that are dif ficult to predict correctly.
After applying our method to data sets, the AUC area of classifiers is improved
dramatically. We compare our method with others of the state-of-the-art, using
more than 10 data sets.
Keywords: Imbalanced Classification Synthetic instances
1 Introduction
Achieving a good performance on imbalanced data sets is a challenging task for
classification methods [3]. They usually focus on majority class, almost ignoring the
opposite class [8]. Currently, there are many real-world applications that generate this
type of data sets, for example: software defect detection [6], medical diagnosis [1],
fraud detection in telecommunications [4], financial risks [7] and DNA sequencing [9],
among others. In this type of applications, there are two objectives in conflict, on the
one hand, for the classifier should be more important to predict the minority class
instances with the minimal errors, and on the other hand, the classification accuracy for
majority class instances should not be severely damaged. The AUC ROC measure is
one of the most widely used to capture this requirement.
The problem of classification on imbalanced data sets has attracted the attention of
the machine learning and data mining communities in the last past few years [2]. The
state-of-the-art methods to deal this problem can be categorized into:
© Springer International Publishing Switzerland 2015
D.-S. Huang and K. Han (Eds.): ICIC 2015, Part III, LNAI 9227, pp. 72–78, 2015.
DOI: 10.1007/978-3-319-22053-6_8