Class-Dependant Resampling for Medical Applications R.M. VALDOVINOS 1 and J.S. SÁNCHEZ 2 1 Lab. Reconocimiento de Patrones, Instituto Tecnológico de Toluca Av. Tecnológico s/n, 52140 Metepec, Edo. de México, MÉXICO li_rmvr@hotmail.com 2 Dept. Llenguatges i Sistemes Informàtics, Universitat Jaume I Av. Sos Baynat s/n, 12071 Castelló de la Plana, SPAIN sanchez@uji.es Abstract: - Bagging, AdaBoost and Arc-x4 are among the most popular methods for classifier ensembles. All these methods rely on resampling techniques to generate different training subsamples for each of the base classifiers that constitute the ensemble. In the present work, the classical implementations of these algorithms are modified in such a way that resampling is performed separately over the training instances of each class, thus obtaining the same class distribution in each subsample as that of the original training set. Moreover, we also introduce other modifications related to the size of the subsamples and the voting strategy. Experimental results for medical and non-medical databases are here presented and potential benefits of the proposed methods for diagnosis are suggested. Key-Words: - Classifier ensemble, Resampling, Bagging, AdaBoost, Arc-x4, Class distribution, Medical databases 1 Introduction One approach in classification tasks consists of using machine learning techniques to derive classifiers using training instances (patterns with known values of all the attributes). In this context, the combination of several base classifiers as an ensemble has succeeded to generally reduce classification error. Ensemble is a learning paradigm where multiple base classifiers make their own predictions that are then combined to generate a single classification result. Since an ensemble is often more accurate than its individual classifiers [3, 7, 8, 14, 17], such a paradigm has become a hot topic in recent years and has already been successfully applied to optical character recognition, stock forecasting, image analysis, face recognition, medical diagnosis, etc. In general, an ensemble is built in two steps, that is, training multiple individual classifiers and then combining their predictions. According to the styles of training the base classifiers, current ensemble algorithms can be roughly categorized into two groups, that is, algorithms where base classifiers must be trained sequentially, and algorithms where base classifiers could be trained in parallel. The most popular example of the first category is AdaBoost [10], which sequentially generates a series of individual classifiers where the training instances that are wrongly predicted by a component will play more important role in the training of its subsequent classifier. Other representatives of this category include Arc-x4 [4], MultiBoost [16], LogitBoost [12], etc. The representative of the second category is Bagging [3], which utilizes bootstrap sampling to generate multiple training sets from the original training set and then trains a classifier from each generated training set. Other examples of this group include SEQUEL [1], Wagging [2], p-Bagging [2], etc. In the present work, we concentrate on bagging, AdaBoost and Arc-x4 because they have been deeply analyzed and have demonstrated to be accurate in a variety of problems. These algorithms are here modified by performing a way of class-dependant resampling over the original training set. The ultimate aim is to obtain the same class distribution for each training subsample as that of the original data set. Besides, we propose other modifications that pursue a reduction in computational complexity of the classifier ensembles. From now on, the rest of the paper is organized as follows. Section 2 provides a brief description of the algorithms here employed: bagging, AdaBoost, and Arc-x4. The modifications proposed in this paper are introduced in Section 3. Next, the experimental results are discussed in Section 4. Finally, Section 5 gives the main conclusions and points out possible directions for future research. 2 Resampling Methods Bagging [3], AdaBoost [10], and Arc-x4 [4] are all very popular resampling methods utilized for