On the Use of Variable Complementarity for Feature Selection in Cancer Classification Patrick E. Meyer and Gianluca Bontempi Universit´e Libre de Bruxelles, (CP 212), 1050 Bruxelles, Belgique, (pmeyer,gbonte)@ulb.ac.be home page: http://ulb.ac.be/di/mlg/ Abstract. The paper presents an original filter approach for effective feature selection in classification tasks with a very large number of input variables. The approach is based on the use of a new information theo- retic selection criterion: the double input symmetrical relevance (DISR). The rationale of the criterion is that a set of variables can return an information on the output class that is higher than the sum of the infor- mations of each variable taken individually. This property will be made explicit by defining the measure of variable complementarity. A feature selection filter based on the DISR criterion is compared in theoretical and experimental terms to recently proposed information theoretic cri- teria. Experimental results on a set of eleven microarray classification tasks show that the proposed technique is competitive with existing fil- ter selection methods. 1 Introduction Statisticians and data-miners are used to build predictive models and infer de- pendencies between variables on the basis of observed data. However, in a lot of emerging domains, like bioinformatics, they are facing datasets characterized by a very large number of features (up to several thousands), a large amount of noise, non-linear dependencies and, often, only several hundreds of samples. In this context, the detection of functional relationships as well as the design of ef- fective classifiers appears to be a major challenge. Recent technological advances, like microarray technology, have made it possible to simultaneously interrogate thousands of genes in a biological specimen. It follows that two classification problems commonly encountered in bioinformatics are how to distinguish be- tween tumor classes and how to predict the effects of medical treatments on the basis of microarray gene expression profiles. If we formalize this prediction task as a supervised classification problem, we realize that we are facing a problem where the number of input variables, represented by the number of genes, is huge (around several thousands) and the number of samples, represented by the clinical trials, is very limited (around several tens). As a consequence, the use of classification techniques in bioinformatics requires the capacity of managing datasets with many variables and few samples (also known as high feature-to- sample ratio datasets). Because of well-known numerical and statistical accuracy