A Novel Similarity Based Under-sampling of Imbalanced Datasets Maira Anis PhD Scholar, School of Management & Economics, University of Electronic Science & Technology of China Chengdu, China Email: maira7pk@hotmail.com Mohsin Ali PhD Scholar, School of Management & Economics, University of Electronic Science & Technology of China Chengdu, China Email: mohsinali757@gmail.com Abstract— Over decades, machine learning is facing the problem of imbalanced datasets. It happens because the number of patterns in one class significantly outnumbers the patterns of the other class. Class with greater number of patterns is the majority class while the other is called minority class. This study investigates a new approach of under-sampling that balances the dataset by eliminating the most similar majority instances using a new distance, average locally centered Mahalanobis distance also known as ࡹ ࡸ . It is more insightful than the traditional proximity measures e.g. Euclidean measure. Results presented in this paper show that this approach perform better from the other under-sampling techniques for evaluation metrics e.g. F-measure and G-mean. Keywords: Novel Under-sampling, Mahalanobis Distance, Imbalanced Datasets I. INTRODUCTION (HEADING 1) One of the prime challenges faced by machine learning community is classification of imbalanced datasets [1, 2]. Imbalanced datasets occur frequently in real world problems e.g. fraud detection [3], oil spill detection [4] and medical diagnosis [5]. Such datasets are always highly imbalanced in nature and have unequal distribution of classes (in this case we are considering binary classification). This unequal distribution is caused when the number of instances belonging to one class are greatly outnumbered by the number of instances of the other class. This skewness of data affects the classifiers performance adversely. Ideally for any machine learning experimentation, the model achieves its best performance for the balanced distribution. Machine learning methods assumes the equal distribution for both the classes. Thus if these methods are applied to the imbalanced data sets, this mechanism learn the rules from the majority class. Classifying such problems overwhelms the classifiers by the majority (negative) class that results in misclassification of minority (positive) class e.g. consider a credit card fraud dataset. Usually the number of legitimate (negative) transactions is more than the fraudulent (positive) transactions. If we have 99 legitimate transactions and 1 fraudulent, then the classifier will misclassify the positive class by giving 100% classification accuracy (overall accuracy). For that reason overall accuracy is a biased classification metric that favor the majority class. Numerous solution have been presented in literature to alter the skewness of data at data level or algorithmic level. This paper focus on the data level approach. Data level approaches can be further categorized to under-sampling techniques, oversampling techniques or combination of both under-sampling and oversampling called hybrid sampling techniques. In order to classify the instances correctly, classifiers needs to be trained on a balanced dataset. This balanced dataset set can be achieved by either under-sampling or oversampling. In this paper we present a novel under-sampling approach that will use a new distance measure namely locally centered Mahalanobis distance denoted by ܦ ଶ to under-sample the negative class to the desired level. ܦ ଶ is more insightful than the Euclidean distance and the classical Mahalanobis distance. In this paper we have raised questions as follows. 1. How this approach would help in dealing with imbalanced datasets? 2. Using ܦ ଶ would maintain the distribution of the original data? 3. How ܦ ଶ cope with the computational cost of under- sampling? ܦ ଶ is briefly explained in section 3. Architecture of the proposed under-sampling is given in Figure 1. Usually the traditional approaches use Euclidean distance that either eliminate the borderline majority samples or the samples away from the boundary. This under-sampling approach uses ܦ ଶ that works at two stages. Firstly it eliminates the majority samples near the boundary that will help to shift the boundary to the majority class and increase the decision space for the minority class. Also misclassification rate of the minority class will be reduced. Secondly it eliminates the most redundant samples of the majority class using ܦ ଶ . ܦ ଶ used in this study uses centralized approach that makes it more effective in selecting the closest samples. Remaining paper is organized as follows. Section 2 gives a brief review of the existing studies for the under-sampling. Section 3 describes the motivation and our proposed approach in detail. Experimentation and the simulation results are International Journal of Computer Science and Information Security (IJCSIS), Vol. 15, No. 1, January 2017 134 https://sites.google.com/site/ijcsis/ ISSN 1947-5500