Oversampling for Imbalanced Learning Based on K-Means and SMOTE Felix Last 1,* , Georgios Douzas 1 , and Fernando Bacao 1 1 NOVA Information Management School, Universidade Nova de Lisboa * Corresponding author: mail@felixlast.de Postal Address: NOVA Information Management School, Campus de Campolide, 1070-312 Lisboa, Portugal Telephone: +351 21 382 8610 Abstract Learning from class-imbalanced data continues to be a common and challenging problem in su- pervised learning as standard classification algorithms are designed to handle balanced class distribu- tions. While different strategies exist to tackle this problem, methods which generate artificial data to achieve a balanced class distribution are more versatile than modifications to the classification algorithm. Such techniques, called oversamplers, modify the training data, allowing any classifier to be used with class-imbalanced datasets. Many algorithms have been proposed for this task, but most are complex and tend to generate unnecessary noise. This work presents a simple and effective oversampling method based on k-means clustering and SMOTE oversampling, which avoids the gen- eration of noise and effectively overcomes imbalances between and within classes. Empirical results of extensive experiments with 71 datasets show that training data oversampled with the proposed method improves classification results. Moreover, k-means SMOTE consistently outperforms other popular oversampling methods. An implementation is made available in the python programming language. 1 Introduction The class imbalance problem in machine learning describes classification tasks in which classes of data are not equally represented. In many real-world applications, the nature of the problem implies a sometimes heavy skew in the class distribution of a binary or multi-class classification problem. Such applications include fraud detection in banking, rare medical diagnoses, and oil spill recognition in satellite images, all of which naturally exhibit a minority class (Chawla et al., 2002; Kotsiantis et al., 2006, 2007; Galar et al., 2012). The predictive capability of classification algorithms is impaired by class imbalance. Many such algo- rithms aim at maximizing classification accuracy, a measure which is biased towards the majority class. A classifier can achieve high classification accuracy even when it does not predict a single minority class instance correctly. For example, a trivial classifier which scores all credit card transactions as legit will score a classification accuracy of 99.9% assuming that 0.1% of transactions are fraudulent; however in this case, all fraud cases remain undetected. In conclusion, by optimizing classification accuracy, most algorithms assume a balanced class distribution (Provost, 2000; Kotsiantis et al., 2007). Another inherent assumption of many classification algorithms is the uniformity of misclassification costs, which is rarely a characteristic of real-world problems. Typically in imbalanced datasets, misclassifying the minority class as the majority class has a higher cost associated with it than vice versa. An example of this is database marketing, where the cost of mailing to a non-respondent is much lower than the lost profit of not mailing to a respondent (Domingos, 1999). Lastly, what is referred to as the “small disjuncts problem” is often encountered in imbalanced datasets (Galar et al., 2012). The problem refers to classification rules covering only a small number of training examples. The presence of only few samples make rule induction more susceptible to error (Holte et al., 1989). To illustrate the importance of discovering high quality rules for sparse areas of the input space, the example 1 arXiv:1711.00837v2 [cs.LG] 12 Dec 2017