Embedded Gene Selection for Imbalanced Microarray Data Analysis Guo-Zheng Li Department of Control Science & Engineering, Tongji University, Shanghai 201804, China drgzli@gmail.com Hao-Hua Meng School of Computer Engineering & Science, Shanghai University, Shanghai 200072, China mhhtj@shu.edu.cn Jun Ni Department of Radiology, University of Iowa, Iowa City, IA 522542, USA jun-ni@uiowa.edu Abstract Most of microarray data sets are imbalanced, i.e. the number of positive examples is much less than that of neg- ative, which will hurt performance of classifiers when it is used for tumor classification. Though it is critical, few pre- vious works paid attention to this problem. Here we propose embedded gene selection with two algorithms i.e. EGSEE (Embedded Gene Selection for EasyEnsemble) and EGSIEE (Embedded Gene Selection for Individuals of EasyEnsem- ble) to treat this problem and improve generalization per- formance of the EasyEnsemble classifier. Experimental results on several microarray data sets show that com- pared with the previous two filter feature selection methods, EGSEE and EGSIEE obtain better performance. 1 Introduction The rapid advances in gene expression microarray tech- nology enable simultaneously measuring the expression levels for thousands or tens of thousands of genes in a single experiment. Analysis of microarray data presents unprece- dented opportunities and challenges for data mining in areas such as gene clustering, class discovery and pattern classi- fication [4]. In pattern classification, a microarray data set is provided as a training set of labeled examples. The task is to build a classifier that accurately predicts the classes of novel unlabeled examples. A typical data set has thou- sands of genes but with only a small number of examples (often less than one hundred). The number of examples is likely to remain small at least for the near future due to the expense of collecting microarray examples [3]. The nature of relatively high dimensionality but small sample size in microarray data causes the well known problem of ”curse of dimensionality”. Therefore, selecting a small number of discriminative genes from thousands of genes is essential for successful pattern classification. Gene selection, a process of choosing a subset of genes from the original ones, is frequently used as a prepro- cessing technique in analysis of microarray data. It has been proved effective in reducing dimensionality, improv- ing mining efficiency, increasing mining accuracy and en- hancing result comprehensibility [5]. In the field of bioin- formatics, the most commonly used procedures of gene se- lection are based on a score which is calculated for all genes individually and genes with the best scores are selected [16, 18]. Gene selection procedures output a list of relevant genes which may be experimentally analyzed by biologists. This method is often denoted as univariate gene selection (filter methods), whose advantages are its simplicity and in- terpretability. Embedded gene selection is another type of popular gene selection methods, it has been proposed re- cently by Guyon et al. [6, 5], which has lower complexity than wrapper gene selection. It depends on the used classi- fiers, so it produce better performance for the used classi- fiers than filter gene selection. Though gene selection helps improve performance of classifiers, imbalance of microarray data sets hurts perfor- mance of the previous proposed methods. Few works have been done on unbalanced microarray data sets, Yang et al. [16] proposed two evaluation scores for gene selec- tion in imbalanced microarray data, while their experiments adopted prediction accuracy to vlidate the methods. Since accuracy maybe fails to find the accuracy of minor positive sample, their experimental results are not confident enough. For the imbalance problem of data sets, many works have been proposed in machine learning field [12, 19, 17], of which the EasyEnsemble classifier proposed by Liu et al. [12] achieved satisfactory results compared with some state-of-arts methods. Combining with the EasyEnsem- ble classifier [12], we propose embedded gene selection with an evaluation criterion prediction risk [11] for anal- ysis of imbalanced microarray data sets, where two algo- rithms EGSEE (Embedded Gene Selection for EasyEnsem- ble) and EGSIEE (Embedded Gene Selection for Individu- als of EasyEnsemble) are proposed to perform gene selec- tion.