D.-S. Huang et al. (Eds.): ICIC 2014, LNCS 8588, pp. 782–788, 2014. © Springer International Publishing Switzerland 2014 A Hybrid Algorithm to Improve the Accuracy of Support Vector Machines on Skewed Data-Sets Jair Cervantes 1 , De-Shuang Huang 2 , Farid García-Lamont 1 , and Asdrúbal López Chau 1 1 Posgrado e Investigación UAEMEX (Autonomous University of Mexico State) Av. Jardín Zumpango s/n, Fracc, El Tejocote, Texcoco, 56259, Mexico 2 Department of Control Science & Engineering, Tongji University Cao'an Road 4800, Shanghai, 201804 China Abstract. Over the past few years, has been shown that generalization power of Support Vector Machines (SVM) falls dramatically on imbalanced data-sets. In this paper, we propose a new method to improve accuracy of SVM on imba- lanced data-sets. To get this outcome, firstly, we used undersampling and SVM to obtain the initial SVs and a sketch of the hyperplane. These support vectors help to generate new artificial instances, which will take part as the initial popu- lation of a genetic algorithm. The genetic algorithm improves the population in artificial instances from one generation to another and eliminates instances that produce noise in the hyperplane. Finally, the generated and evolved data were included in the original data-set for minimizing the imbalance and improving the generalization ability of the SVM on skewed data-sets. Keywords: Support Vector Machines, Hybrid, Imbalanced. 1 Introduction Many real-world applications show imbalance in data-sets. In these problems the goal in classification problems is to find a function that best generalizes the minority class, usually it is the most significant one. Traditionally, classical classification methods do not perform well on imbalanced data-sets, because they were not designed to address such problems. Support Vector Machines (SVM) have shown excellent generalization power in classification problems. However, it has been shown that this generalization ability of SVM drops dramatically on skewed data-sets [7] [10]. The most widely used techniques to tackle this kind of problems are under-sampling, over-sampling and Synthetic Minority Over-sampling Technique (SMOTE) [2]. Under-sampling, gets the number of instances m in the minority class and selects randomly m instances in the majority class. Over-sampling technique eliminates the imbalance by replicating data instances in the minority class or generating artificial instances from the minority class. SMOTE Generates artificial instances over-sampling the minority class by taking each minority class instance and generating synthetic instances along the line segments, joining any or all of the k minority class nearest neighbors. It does not cause any information loss and could potentially find hidden minority regions.