Embedded Gene Selection for Imbalanced Microarray Data Analysis Guo-Zheng Li Department of Control Science & Engineering, Tongji University, Shanghai 201804, China drgzli@gmail.com Hao-Hua Meng School of Computer Engineering & Science, Shanghai University, Shanghai 200072, China mhhtj@shu.edu.cn Jun Ni Department of Radiology, University of Iowa, Iowa City, IA 522542, USA jun-ni@uiowa.edu Abstract Most of microarray data sets are imbalanced, i.e. the number of positive examples is much less than that of neg- ative, which will hurt performance of classiﬁers when it is used for tumor classiﬁcation. Though it is critical, few pre- vious works paid attention to this problem. Here we propose embedded gene selection with two algorithms i.e. EGSEE (Embedded Gene Selection for EasyEnsemble) and EGSIEE (Embedded Gene Selection for Individuals of EasyEnsem- ble) to treat this problem and improve generalization per- formance of the EasyEnsemble classiﬁer. Experimental results on several microarray data sets show that com- pared with the previous two ﬁlter feature selection methods, EGSEE and EGSIEE obtain better performance. 1 Introduction The rapid advances in gene expression microarray tech- nology enable simultaneously measuring the expression levels for thousands or tens of thousands of genes in a single experiment. Analysis of microarray data presents unprece- dented opportunities and challenges for data mining in areas such as gene clustering, class discovery and pattern classi- ﬁcation [4]. In pattern classiﬁcation, a microarray data set is provided as a training set of labeled examples. The task is to build a classiﬁer that accurately predicts the classes of novel unlabeled examples. A typical data set has thou- sands of genes but with only a small number of examples (often less than one hundred). The number of examples is likely to remain small at least for the near future due to the expense of collecting microarray examples [3]. The nature of relatively high dimensionality but small sample size in microarray data causes the well known problem of ”curse of dimensionality”. Therefore, selecting a small number of discriminative genes from thousands of genes is essential for successful pattern classiﬁcation. Gene selection, a process of choosing a subset of genes from the original ones, is frequently used as a prepro- cessing technique in analysis of microarray data. It has been proved effective in reducing dimensionality, improv- ing mining efﬁciency, increasing mining accuracy and en- hancing result comprehensibility [5]. In the ﬁeld of bioin- formatics, the most commonly used procedures of gene se- lection are based on a score which is calculated for all genes individually and genes with the best scores are selected [16, 18]. Gene selection procedures output a list of relevant genes which may be experimentally analyzed by biologists. This method is often denoted as univariate gene selection (ﬁlter methods), whose advantages are its simplicity and in- terpretability. Embedded gene selection is another type of popular gene selection methods, it has been proposed re- cently by Guyon et al. [6, 5], which has lower complexity than wrapper gene selection. It depends on the used classi- ﬁers, so it produce better performance for the used classi- ﬁers than ﬁlter gene selection. Though gene selection helps improve performance of classiﬁers, imbalance of microarray data sets hurts perfor- mance of the previous proposed methods. Few works have been done on unbalanced microarray data sets, Yang et al. [16] proposed two evaluation scores for gene selec- tion in imbalanced microarray data, while their experiments adopted prediction accuracy to vlidate the methods. Since accuracy maybe fails to ﬁnd the accuracy of minor positive sample, their experimental results are not conﬁdent enough. For the imbalance problem of data sets, many works have been proposed in machine learning ﬁeld [12, 19, 17], of which the EasyEnsemble classiﬁer proposed by Liu et al. [12] achieved satisfactory results compared with some state-of-arts methods. Combining with the EasyEnsem- ble classiﬁer [12], we propose embedded gene selection with an evaluation criterion prediction risk [11] for anal- ysis of imbalanced microarray data sets, where two algo- rithms EGSEE (Embedded Gene Selection for EasyEnsem- ble) and EGSIEE (Embedded Gene Selection for Individu- als of EasyEnsemble) are proposed to perform gene selec- tion.