On the Use of Variable Complementarity for Feature Selection in Cancer Classiﬁcation Patrick E. Meyer and Gianluca Bontempi Universit´e Libre de Bruxelles, (CP 212), 1050 Bruxelles, Belgique, (pmeyer,gbonte)@ulb.ac.be home page: http://ulb.ac.be/di/mlg/ Abstract. The paper presents an original ﬁlter approach for eﬀective feature selection in classiﬁcation tasks with a very large number of input variables. The approach is based on the use of a new information theo- retic selection criterion: the double input symmetrical relevance (DISR). The rationale of the criterion is that a set of variables can return an information on the output class that is higher than the sum of the infor- mations of each variable taken individually. This property will be made explicit by deﬁning the measure of variable complementarity. A feature selection ﬁlter based on the DISR criterion is compared in theoretical and experimental terms to recently proposed information theoretic cri- teria. Experimental results on a set of eleven microarray classiﬁcation tasks show that the proposed technique is competitive with existing ﬁl- ter selection methods. 1 Introduction Statisticians and data-miners are used to build predictive models and infer de- pendencies between variables on the basis of observed data. However, in a lot of emerging domains, like bioinformatics, they are facing datasets characterized by a very large number of features (up to several thousands), a large amount of noise, non-linear dependencies and, often, only several hundreds of samples. In this context, the detection of functional relationships as well as the design of ef- fective classiﬁers appears to be a major challenge. Recent technological advances, like microarray technology, have made it possible to simultaneously interrogate thousands of genes in a biological specimen. It follows that two classiﬁcation problems commonly encountered in bioinformatics are how to distinguish be- tween tumor classes and how to predict the eﬀects of medical treatments on the basis of microarray gene expression proﬁles. If we formalize this prediction task as a supervised classiﬁcation problem, we realize that we are facing a problem where the number of input variables, represented by the number of genes, is huge (around several thousands) and the number of samples, represented by the clinical trials, is very limited (around several tens). As a consequence, the use of classiﬁcation techniques in bioinformatics requires the capacity of managing datasets with many variables and few samples (also known as high feature-to- sample ratio datasets). Because of well-known numerical and statistical accuracy