2003 Special issue An accelerated procedure for recursive feature ranking on microarray data C. Furlanello * , M. Serafini, S. Merler, G. Jurman ITC-irst, v. Sommarive 18, Povo, I-38050 Trento, Italy Abstract We describe a new wrapper algorithm for fast feature ranking in classification problems. The Entropy-based Recursive Feature Elimination (E-RFE) method eliminates chunks of uninteresting features according to the entropy of the weights distribution of a SVM classifier. With specific regard to DNA microarray datasets, the method is designed to support computationally intensive model selection in classification problems in which the number of features is much larger than the number of samples. We test E-RFE on synthetic and real data sets, comparing it with other SVM-based methods. The speed-up obtained with E-RFE supports predictive modeling on high dimensional microarray data. q 2003 Elsevier Science Ltd. All rights reserved. Keywords: Entropy-based recursive feature elimination; Microarray; Support Vector Machines 1. Introduction For microarray data, class prediction generally refers to models of response to treatment with existing or new therapies, or to the detection of disease type or subtype (Chung, Bernard, & Perou, 2002; Slonim, 2002). Significant benefits to patients are expected from these efforts to develop new diagnostic tools as well as from the elucidation of disease functional mechanisms at the molecular level. Microarrays, however, offer unusual challenges to machine learning algorithms, as gene expression data matrices measure thousands of genes (features) at tens or hundreds of samples. Ranking genes according to their importance in contributing to models’ predictive accuracy is thus crucial not only as a machine learning problem, but also in the perspective of application. Support Vector Machines (SVMs) are considered a performing classification method for gene-expression data and were soon embedded with feature selection procedures (Cristianini & Shawe-Taylor, 2000; Guyon, Weston, Barnhill, & Vapnik, 2002; Li, Campbell, & Tipping, 2002; Nguyen & Rocke, 2002; Weston et al., 2000; Xiong, Fang, & Zhao, 2001; Zhang & Wong, 2001). The procedure we discuss in this paper was motivated by a subtle methodological problem that has emerged in this field, resulting in initial studies on microarray data which reported of very few genes discovered to yield classification models with negligible or zero error rates. As discussed in Ambroise and McLachlan, 2002 and confirmed by our study (Furlanello, Serafini, Merler, & Jurman, 2002), the feature- selection process has to be separated from the performance assessment, otherwise uncorrected estimates of the predic- tion error are obtained (the selection bias problem). Careful experimental Scheme are thus required by the gene ranking and selection procedures when they depend on the optimization of a classification rule. The process of the evaluation of error for models developed on a reduced number of interesting genes should operate out-of-sample from the data involved in the selection process (Spang et al., 2001). The two processes also need to be intensively based on partition or resampling methods to smooth data variability. As a crucial consequence, the model selection processes (including feature selection) require an intensive replication of the classification-and-ranking steps. The entropy-based recursive feature elimination (E-RFE) procedure we analyse in this paper was especially designed to provide an accelerated feature ranking process for the predictive modeling analysis in microarray data experiments. In this paper we discuss the feature selection properties of E-RFE. We also consider the speed-up of E-RFE with respect to the base method (RFE, described in Guyon et al., 2002) and the parametric accelerated variant SQRT-RFE. Furthermore, on the basis of the experiments carried out in Guyon et al., 2002, we also show on synthetic data sets that 0893-6080/03/$ - see front matter q 2003 Elsevier Science Ltd. All rights reserved. doi:10.1016/S0893-6080(03)00103-5 Neural Networks 16 (2003) 641–648 www.elsevier.com/locate/neunet * Corresponding author. Tel.: þ 39-461-314592; fax: þ 39-461-302040. E-mail addresses: furlan@itc.it (C. Furlanello), mserafini@itc.it (M. Serafini), merler@itc.it (S. Merler), jurman@itc.it (G. Jurman).