2003 Special issue An accelerated procedure for recursive feature ranking on microarray data C. Furlanello * , M. Seraﬁni, S. Merler, G. Jurman ITC-irst, v. Sommarive 18, Povo, I-38050 Trento, Italy Abstract We describe a new wrapper algorithm for fast feature ranking in classiﬁcation problems. The Entropy-based Recursive Feature Elimination (E-RFE) method eliminates chunks of uninteresting features according to the entropy of the weights distribution of a SVM classiﬁer. With speciﬁc regard to DNA microarray datasets, the method is designed to support computationally intensive model selection in classiﬁcation problems in which the number of features is much larger than the number of samples. We test E-RFE on synthetic and real data sets, comparing it with other SVM-based methods. The speed-up obtained with E-RFE supports predictive modeling on high dimensional microarray data. q 2003 Elsevier Science Ltd. All rights reserved. Keywords: Entropy-based recursive feature elimination; Microarray; Support Vector Machines 1. Introduction For microarray data, class prediction generally refers to models of response to treatment with existing or new therapies, or to the detection of disease type or subtype (Chung, Bernard, & Perou, 2002; Slonim, 2002). Signiﬁcant beneﬁts to patients are expected from these efforts to develop new diagnostic tools as well as from the elucidation of disease functional mechanisms at the molecular level. Microarrays, however, offer unusual challenges to machine learning algorithms, as gene expression data matrices measure thousands of genes (features) at tens or hundreds of samples. Ranking genes according to their importance in contributing to models’ predictive accuracy is thus crucial not only as a machine learning problem, but also in the perspective of application. Support Vector Machines (SVMs) are considered a performing classiﬁcation method for gene-expression data and were soon embedded with feature selection procedures (Cristianini & Shawe-Taylor, 2000; Guyon, Weston, Barnhill, & Vapnik, 2002; Li, Campbell, & Tipping, 2002; Nguyen & Rocke, 2002; Weston et al., 2000; Xiong, Fang, & Zhao, 2001; Zhang & Wong, 2001). The procedure we discuss in this paper was motivated by a subtle methodological problem that has emerged in this ﬁeld, resulting in initial studies on microarray data which reported of very few genes discovered to yield classiﬁcation models with negligible or zero error rates. As discussed in Ambroise and McLachlan, 2002 and conﬁrmed by our study (Furlanello, Seraﬁni, Merler, & Jurman, 2002), the feature- selection process has to be separated from the performance assessment, otherwise uncorrected estimates of the predic- tion error are obtained (the selection bias problem). Careful experimental Scheme are thus required by the gene ranking and selection procedures when they depend on the optimization of a classiﬁcation rule. The process of the evaluation of error for models developed on a reduced number of interesting genes should operate out-of-sample from the data involved in the selection process (Spang et al., 2001). The two processes also need to be intensively based on partition or resampling methods to smooth data variability. As a crucial consequence, the model selection processes (including feature selection) require an intensive replication of the classiﬁcation-and-ranking steps. The entropy-based recursive feature elimination (E-RFE) procedure we analyse in this paper was especially designed to provide an accelerated feature ranking process for the predictive modeling analysis in microarray data experiments. In this paper we discuss the feature selection properties of E-RFE. We also consider the speed-up of E-RFE with respect to the base method (RFE, described in Guyon et al., 2002) and the parametric accelerated variant SQRT-RFE. Furthermore, on the basis of the experiments carried out in Guyon et al., 2002, we also show on synthetic data sets that 0893-6080/03/$ - see front matter q 2003 Elsevier Science Ltd. All rights reserved. doi:10.1016/S0893-6080(03)00103-5 Neural Networks 16 (2003) 641–648 www.elsevier.com/locate/neunet * Corresponding author. Tel.: þ 39-461-314592; fax: þ 39-461-302040. E-mail addresses: furlan@itc.it (C. Furlanello), mseraﬁni@itc.it (M. Seraﬁni), merler@itc.it (S. Merler), jurman@itc.it (G. Jurman).