Genetic Cooperative-Competitive Fuzzy Rule Based Learning Method using Genetic Programming for Highly Imbalanced Data-Sets Alberto Fern´ andez 1 Francisco J. Berlanga 2 Mar´ ıa J. del Jesus 3 Francisco Herrera 1 1.Department of Computer Science and Artiﬁcial Intelligence, University of Granada Granada, Spain 2.Department of Computer Science and Systems Engineering, University of Zaragoza Zaragoza, Spain 3.Department of Computer Science, University of Ja´ en Ja´ en, Spain Email: alberto@decsai.ugr.es, berlanga@unizar.es, mjjesus@ujaen.es, herrera@decsai.ugr.es Abstract— Classiﬁcation in imbalanced domains is an important problem in Data Mining. We refer to imbalanced classiﬁcation when data presents many examples from one class and few from the other class, and the less representative class is the one which has more interest from the point of view of the learning task. The aim of this work is to study the behaviour of the GP-COACH algorithm in the scenario of data-sets with high imbalance, analysing both the per- formance and the interpretability of the obtained fuzzy models. To develop the experimental study we will compare this approach with a well-known fuzzy rule learning algorithm, the Chi et al.’s method, and an algorithm of reference in the ﬁeld of imbalanced data-sets, the C4.5 decision tree. Keywords— Fuzzy Rule-Based Classiﬁcation Systems, Genetic Fuzzy Systems, Genetic Programming, Imbalanced Data-Sets, Inter- pretability 1 Introduction In the area of Data Mining, real world classiﬁcation problems present some features that can diminish the accuracy of Ma- chine Learning algorithms, such as the presence of noise or missing values, or the imbalanced distribution of classes. Speciﬁcally, the problem of imbalanced data-sets has been considered as one of the emergent challenges in Data Mining [1]. This situation occurs when one class is represented by a large number of examples (known as negative class), whereas the other is represented by only a few (positive class). Our objective is to develop an empirical analysis in the con- text of imbalance classiﬁcation for binary data-sets when the class imbalance ratio is high. In this study, we will make use of Fuzzy Rule Based Classiﬁcation Systems (FRBCSs), a very useful tool in the ambit of Machine Learning, since they pro- vide a very interpretable model for the end user [2]. We will employ a novel approach, GP-COACH (Genetic Programming-based evolutionary algorithm for the learning of COmpact and ACcurate FRBCS) [3], that learns disjun- ctive normal form (DNF) fuzzy rules (generated by means of a context-free grammar) and obtains very interpretable FRBCSs, with few rules and conditions per rule, with a high- generalization capability. We want to analyse whether this model is accurate for data- sets with high imbalance in contrast with an FRBCS, the Chi et al.’s approach [4] and with C4.5 [5], a decision tree algo- rithm that has been used as a reference in the imbalanced data- sets ﬁeld [6, 7]. We will also focus on the tradeoff between accuracy and interpretability [8] for the ﬁnal obtained models. We will employ the Area Under the Curve (AUC) metric [9] to compute the classiﬁcation performance, whereas we will mea- sure the interpretability of the system by means of the number of rules in the system. We have selected a large collection of data-sets with high imbalance from UCI repository [10] for developing our empi- rical analysis. In order to deal with the problem of imbalan- ced data-sets we will make use of a preprocessing technique, the “Synthetic Minority Over-sampling Technique” (SMOTE) [11], to balance the distribution of training examples in both classes. In this manner, we will analyse the positive synergy between the GP-COACH model and the SMOTE preproces- sing technique for dealing with imbalanced data-sets. Further- more, we will perform a statistical study using non-parametric tests [12, 13, 14] to ﬁnd signiﬁcant differences among the ob- tained results. This contribution is organized as follows. First, Section 2 introduces the problem of imbalanced data-sets, describing its features, how to deal with this problem and the metric we have employed in this context. Next, in Section 3 we present the GP-COACH algorithm, explaining in detail the characteristics of this novel approach. Section 4 contains the experimental study for GP-COACH, Chi et al.’s and C4.5 algorithms re- garding performance and interpretability. Finally, Section 5 summarizes and concludes the work. 2 Imbalanced Data-Sets in Classiﬁcation Learning from imbalanced data is an important topic that has recently appeared in the Machine Learning community [15, 16, 17]. The signiﬁcance of this problem consists in its presence in most of the real domains of classiﬁcation, such as fraud detection [18], risk management [19] and medical appli- cations [20] among others. This problem occurs when the number of instances of one class is much lower than the instances of the other classes. In this situation, the class of interest is often the one with the smaller number of examples, whereas the other class(es) re- present(s) the counterpart of that concept and, in that manner, include(s) a high amount of data. Standard classiﬁer algorithms have a bias towards the ma- jority class, since the rules that predicts the larger number of ISBN: 978-989-95079-6-8 IFSA-EUSFLAT 2009 42