D.-S. Huang et al. (Eds.): ICIC 2014, LNCS 8588, pp. 782–788, 2014.
© Springer International Publishing Switzerland 2014
A Hybrid Algorithm to Improve the Accuracy
of Support Vector Machines on Skewed Data-Sets
Jair Cervantes
1
, De-Shuang Huang
2
, Farid García-Lamont
1
,
and Asdrúbal López Chau
1
1
Posgrado e Investigación UAEMEX (Autonomous University of Mexico State)
Av. Jardín Zumpango s/n, Fracc, El Tejocote, Texcoco, 56259, Mexico
2
Department of Control Science & Engineering,
Tongji University Cao'an Road 4800, Shanghai, 201804 China
Abstract. Over the past few years, has been shown that generalization power of
Support Vector Machines (SVM) falls dramatically on imbalanced data-sets. In
this paper, we propose a new method to improve accuracy of SVM on imba-
lanced data-sets. To get this outcome, firstly, we used undersampling and SVM
to obtain the initial SVs and a sketch of the hyperplane. These support vectors
help to generate new artificial instances, which will take part as the initial popu-
lation of a genetic algorithm. The genetic algorithm improves the population in
artificial instances from one generation to another and eliminates instances that
produce noise in the hyperplane. Finally, the generated and evolved data were
included in the original data-set for minimizing the imbalance and improving
the generalization ability of the SVM on skewed data-sets.
Keywords: Support Vector Machines, Hybrid, Imbalanced.
1 Introduction
Many real-world applications show imbalance in data-sets. In these problems the goal
in classification problems is to find a function that best generalizes the minority class,
usually it is the most significant one. Traditionally, classical classification methods do
not perform well on imbalanced data-sets, because they were not designed to address
such problems. Support Vector Machines (SVM) have shown excellent generalization
power in classification problems. However, it has been shown that this generalization
ability of SVM drops dramatically on skewed data-sets [7] [10]. The most widely
used techniques to tackle this kind of problems are under-sampling, over-sampling
and Synthetic Minority Over-sampling Technique (SMOTE) [2]. Under-sampling,
gets the number of instances m in the minority class and selects randomly m instances
in the majority class. Over-sampling technique eliminates the imbalance by
replicating data instances in the minority class or generating artificial instances from
the minority class. SMOTE Generates artificial instances over-sampling the minority
class by taking each minority class instance and generating synthetic instances along
the line segments, joining any or all of the k minority class nearest neighbors. It does
not cause any information loss and could potentially find hidden minority regions.