Balancing Strategies and Class Overlapping ⋆ Gustavo E. A. P. A. Batista 1,2 , Ronaldo C. Prati 1 , and Maria C. Monard 1 1 Institute of Mathematics and Computer Science at University of S˜ao Paulo P. O. Box 668, ZIP Code 13560-970 S˜ao Carlos (SP), Brazil 2 Faculty of Computer Engineering at Pontifical Catholic University of Campinas Rodovia D. Pedro I, Km 136, ZIP Code 13086-900 Campinas (SP), Brazil {gbatista,prati,mcmonard} at icmc usp br Abstract. Several studies have pointed out that class imbalance is a bottleneck in the performance achieved by standard supervised learning systems. However, a complete understanding of how this problem affects the performance of learning is still lacking. In previous work we identified that performance degradation is not solely caused by class imbalances, but is also related to the degree of class overlapping. In this work, we conduct our research a step further by investigating sampling strategies which aim to balance the training set. Our results show that these sam- pling strategies usually lead to a performance improvement for highly imbalanced data sets having highly overlapped classes. In addition, over- sampling methods seem to outperform under-sampling methods. 1 Introduction Supervised Machine Learning – ML – systems aim to automatically create a classification model from a set of labeled training examples. Once the model is created, it can be used to automatically predict the class label of unlabeled examples. In many real-world applications, it is common to have a huge intrinsic disproportion in the number of examples in each class. This fact is known as the class imbalance problem and occurs whenever examples of one class heavily outnumber examples of the other class 3 . Generally, the minority class represents a circumscribed concept, while the other class represents the counterpart of that concept. Several studies have pointed out that domains with a high class imbalance might cause a significant bottleneck in the performance achieved by standard ML systems. Even though class imbalance is a problem of great importance in ML, a complete understanding of how this problem affects the performance ⋆ This research is partly supported by Brazilian Research Councils CAPES and FAPESP. 3 Although in this work we deal with two-class problems, this discussion also applies to multi-class problems. Furthermore, positive and negative labels are used to denominate the minority and majority classes, respectively.