THEORETICAL ADVANCES On the k-NN performance in a challenging scenario of imbalance and overlapping V. Garcı ´a Æ R. A. Mollineda Æ J. S. Sa ´nchez Received: 6 January 2007 / Accepted: 10 July 2007 / Published online: 28 September 2007 Ó Springer-Verlag London Limited 2007 Abstract A two-class data set is said to be imbalanced when one (minority) class is heavily under-represented with respect to the other (majority) class. In the presence of a significant overlapping, the task of learning from imbalanced data can be a very difficult problem. Addi- tionally, if the overall imbalance ratio is different from local imbalance ratios in overlap regions, the task can become in a major challenge. This paper explains the behaviour of the k-nearest neighbour (k-NN) rule when learning from such a complex scenario. This local model is compared to other machine learning algorithms, attending to how their behaviour depends on a number of data complexity features (global imbalance, size of overlap region, and its local imbalance). As a result, several con- clusions useful for classifier design are inferred. Keywords Imbalanced data Nearest neighbour rule Class overlap Local and global learning Overall imbalance ratio Local imbalance ratio 1 Introduction The class imbalance problem has received considerable attention in areas such as machine learning and pattern recognition. A two-class data set is said to be imbalanced when one of the classes (the minority one) is heavily under- represented in comparison to the other class (the majority one). This issue is particularly important in real-world applications where it is costly to misclassify examples from the minority class, such as the diagnosis of rare diseases, the detection of fraudulent telephone calls, insurance claims, among others. Because of examples of the minority and majority classes usually represent the presence and absence of rare cases, respectively, they are also known as positive and negative examples. The research in this topic has mainly focused on a number of solutions for learning from imbalanced data, which can be divided into three categories (that can also be combined): 1. Cost-sensitive learning [1–3]. 2. To resample the original training set, either by over- sampling the minority class and/or under-sampling the majority class, until the classes are approximately equally represented [4, 5]. 3. Internally biasing the discrimination-based process so as to compensate for the class imbalance [6, 7]. Many other studies on the behaviour of several standard classifiers in imbalance domains have shown that signifi- cant loss of performance is mainly due to skew of class distributions. However, recent investigations also suggest that there are other factors that contribute to such perfor- mance degradation, for example, size of the data set, class imbalance level, small disjuncts, density, and overlap complexity [8–12]. With respect to the latter, results on C4.5 and Fuzzy classifiers show that the overlap affects more than imbalance [11, 13]. It would be interesting to identify the degree of influence of each factor (and their interdependences) in the operation of each classifier. V. Garcı ´a (&) Laboratorio de Reconocimiento de Patrones, Instituto Tecnolo ´gico de Toluca. Av. Tecnolo ´gico s/n, Metepec 52140, Me ´xico e-mail: vgarciaj@hotmail.com R. A. Mollineda J. S. Sa ´nchez Departament de Llenguatges i Sistemes Informa `tics, Universitat Jaume I. Av. Vicent Sos Baynat s/n, 12071 Castello ´, Spain 123 Pattern Anal Applic (2008) 11:269–280 DOI 10.1007/s10044-007-0087-5