Oversampling for Imbalanced Learning Based on K-Means and SMOTE Felix Last 1,* , Georgios Douzas 1 , and Fernando Bacao 1 1 NOVA Information Management School, Universidade Nova de Lisboa * Corresponding author: mail@felixlast.de Postal Address: NOVA Information Management School, Campus de Campolide, 1070-312 Lisboa, Portugal Telephone: +351 21 382 8610 Abstract Learning from class-imbalanced data continues to be a common and challenging problem in su- pervised learning as standard classiﬁcation algorithms are designed to handle balanced class distribu- tions. While diﬀerent strategies exist to tackle this problem, methods which generate artiﬁcial data to achieve a balanced class distribution are more versatile than modiﬁcations to the classiﬁcation algorithm. Such techniques, called oversamplers, modify the training data, allowing any classiﬁer to be used with class-imbalanced datasets. Many algorithms have been proposed for this task, but most are complex and tend to generate unnecessary noise. This work presents a simple and eﬀective oversampling method based on k-means clustering and SMOTE oversampling, which avoids the gen- eration of noise and eﬀectively overcomes imbalances between and within classes. Empirical results of extensive experiments with 71 datasets show that training data oversampled with the proposed method improves classiﬁcation results. Moreover, k-means SMOTE consistently outperforms other popular oversampling methods. An implementation is made available in the python programming language. 1 Introduction The class imbalance problem in machine learning describes classiﬁcation tasks in which classes of data are not equally represented. In many real-world applications, the nature of the problem implies a sometimes heavy skew in the class distribution of a binary or multi-class classiﬁcation problem. Such applications include fraud detection in banking, rare medical diagnoses, and oil spill recognition in satellite images, all of which naturally exhibit a minority class (Chawla et al., 2002; Kotsiantis et al., 2006, 2007; Galar et al., 2012). The predictive capability of classiﬁcation algorithms is impaired by class imbalance. Many such algo- rithms aim at maximizing classiﬁcation accuracy, a measure which is biased towards the majority class. A classiﬁer can achieve high classiﬁcation accuracy even when it does not predict a single minority class instance correctly. For example, a trivial classiﬁer which scores all credit card transactions as legit will score a classiﬁcation accuracy of 99.9% assuming that 0.1% of transactions are fraudulent; however in this case, all fraud cases remain undetected. In conclusion, by optimizing classiﬁcation accuracy, most algorithms assume a balanced class distribution (Provost, 2000; Kotsiantis et al., 2007). Another inherent assumption of many classiﬁcation algorithms is the uniformity of misclassiﬁcation costs, which is rarely a characteristic of real-world problems. Typically in imbalanced datasets, misclassifying the minority class as the majority class has a higher cost associated with it than vice versa. An example of this is database marketing, where the cost of mailing to a non-respondent is much lower than the lost proﬁt of not mailing to a respondent (Domingos, 1999). Lastly, what is referred to as the “small disjuncts problem” is often encountered in imbalanced datasets (Galar et al., 2012). The problem refers to classiﬁcation rules covering only a small number of training examples. The presence of only few samples make rule induction more susceptible to error (Holte et al., 1989). To illustrate the importance of discovering high quality rules for sparse areas of the input space, the example 1 arXiv:1711.00837v2 [cs.LG] 12 Dec 2017