Gene Ranking from Microarray Data for Cancer Classification–A Machine Learning Approach Roberto Ruiz 1 , Beatriz Pontes 1 , Raúl Giráldez 2 , and Jesús S. Aguilar–Ruiz 2 1 Department of Computer Science, University of Seville Avenida Reina Mercedes s/n, 41012 Sevilla, Spain {rruiz, bepontes}@lsi.us.es 2 Area of Computer Science, University of Pablo de Olavide Ctra. de Utrera, km. 1, 41013, Sevilla, Spain {rgirroj, jsagurui}@upo.es Abstract. Traditional gene selection methods often select the top–ranked genes according to their individual discriminative power. We propose to apply feature evaluation measure broadly used in the machine learning field and not so popular in the DNA microarray field. Besides, the application of sequential gene subset selection approaches is included. In our study, we propose some well-known criteria (filters and wrappers) to rank attributes, and a greedy search procedure combined with three subset evaluation measures. Two completely different machine learning classifiers are applied to perform the class prediction. The comparison is performed on two well–known DNA microarray data sets. We notice that most of the top-ranked genes appear in the list of relevant–informative genes detected by previous studies over these data sets. 1 Introduction The gene expression data are typically organized in microarrays. These are ma- trices where columns represent genes and rows represent experimental conditions (henceforth samples). Each element in the matrix refers to the expression level of a particular gene under a specific condition. Analysis of microarray data presents unprecedented opportunities and chal- lenges for data mining in areas such as gene clustering [1], sample clustering and class discovery [1,4], sample classification [4] and gene selection [6,9,16,18]. In this work, we address the gene selection issue under a classification framework. The task is to build a classifier that accurately predicts the classes (diseases or phenotypes) of new unlabeled samples. A typical data set may contain thou- sand of genes but only small number of samples (often less than two hundred). Theoretically, having more features should give us more discriminating power. However, this can cause several problems: increase computational complexity and cost; too many redundant or irrelevant genes; and degradation of the esti- mation of the classification error. In addition to reducing noise and improving ⋆ This research was supported by the Spanish Research Agency CICYT under grants TIN2004–00159 and TIN2004-06689C0303.