Borderline Over-sampling in Feature Space for Learning Algorithms in Imbalanced Data Environments Kittipat Savetratanakaree, Member, IAENG Kingkarn Sookhanaphibarn, Sarun Intakosum and Ruck Thawonmas Abstract—In this paper, we propose a new approach to over-sample new minority-class instances along the borderline using the Euclidean distance in the feature space to improve support vector machine (SVM) performance in imbalanced data environments. SVM has been an outstandingly successful classiﬁer in a wide variety of applications where balanced class data distribution is assumed. However, SVM is ineffective when coping with imbalanced datasets whereby the majority- class instances far outnumber the minority-class instances. Our new approach, called Borderline Over-sampling in the Feature Space, can deal with imbalanced data to effectively recognize new minority-class instances for better classiﬁcation with SVM. The results of our class prediction experiments using the proposed approach demonstrate better performance than the existing SMOTE, Borderline-SMOTE and borderline over-sampling methods in terms of the g-mean and F-measure. Index Terms—Borderline Over-sampling in the Feature Space, Imbalanced Dataset, Over-sampling, SVM in Imbalanced Data Environments I. I NTRODUCTION S UPPORT vector machine (SVM) is an extremely successful classiﬁer proposed by Vapnik [1] under the presumed condition of balanced data distributions among classes. However, SVM is ineffective when mining data with imbalanced classes. An imbalanced dataset in which the representation between classes is not approximately equal. There are many applications in real-world domains that have innately imbalanced datasets including fraudulent telephone call detection [2], oil spill detection in satellite images [3], telecommunications risk management [4], credit card fraud detection [5], IVF embryos implantation [6], balancing class in clinical dataset[7], text categorization, and unusual disease diagnosis [8]. By mining a large amount of balanced data, SVM classiﬁers can extract valuable knowledge for decision making support and other objectives. Hidden valuable knowledge sometimes resides in minority-class instances. Minority-class instances are thus often more useful than the majority-class instances and are also called positive Manuscript received February 12, 2016, revised May 30, 2016. K. Savetratanakaree and S. Intakosum are with the Computer Science Department, School of Science, King Mongkut’s Institute of Technology, Ladkrabang, Bangkok, Thailand. K. Sookhanaphibarn is with the Computer Science and Software Engineering Department, School of Science and Technology, Bangkok University, Bangkok, Thailand. R. Thawonmas is with Ritsumeikan University, Shiga, Japan. Part of this work by the fourth author was supported in part by Grant-in- Aid for Scientiﬁc Research (C), Number 26330421, JSPS. Corresponding author is K. Savetratanakaree, Email: kittipatsavet@gmail.com. instances. Majority-class instances are also called negative instances. On imbalanced datasets, the positive instances are generally misclassiﬁed by SVM classiﬁers because they can be treated as noise. In some cases, the issue of class imbalance is critical and cannot be ignored. One example is the classiﬁcation of pixels in mammogram images for possible breast cancer [9]. In this application, the majority-class of normal pixels might contain 98% of the data, whereas the minority-class of abnormal pixels may contain only 2%. If the machine learning algorithm ignores the abnormal pixels, patients’ lives could be threatened. Classiﬁcation algorithms often perform worse in the detection of such unusual cases, which tend to be the most important ones. There are several methods [10] for overcoming the imbal- anced class problem in SVM. The methods are classiﬁed into two main groups. The ﬁrst group comprises external meth- ods: data preprocessing methods that adjust the distribution of class datasets before training SVM classiﬁers. The second group comprises internal methods: algorithmic modiﬁcations to SVM to decrease its sensitivity to imbalanced classes. In this paper, we propose an over-sampling method called Borderline Over-sampling in the Feature Space (BOSFS) that ﬁts into the ﬁrst group of data preprocessing meth- ods. BOSFS conducts over-sampling by generating new synthetic minority-class instances with the nearest existing neighbors focused on the borderline in the feature space. These new synthetic instances are combined with the orig- inal imbalanced training dataset to form a new training dataset. SVM is trained using the new training dataset, and then assessed on an independent testing dataset. With this new BOSFS method, the SVM classiﬁer achieves higher recognition performance for the minority-class instances in the imbalanced testing dataset. II. BACKGROUND AND RELATED WORK In the external methods category, there are two different approaches: resampling and ensemble learning. First, re- sampling methods [11] consist of random, focused under- or over-sampling methods. These methods balance the minority-class instances and majority-class instances in the datasets before training SVM models. In the under-sampling approach, the random instances of the majority-class are removed until the datasets are balanced. In the over-sampling approach, the minority-class instances are randomly dupli- cated to achieve an approximately one-to-one ratio with the majority-class instances. Some research [12] [13] [14] IAENG International Journal of Computer Science, 43:3, IJCS_43_3_12 (Advance online publication: 27 August 2016) ______________________________________________________________________________________