A Three-Stage Method to Select Informative Genes from Gene Expression Data in Classifying Cancer Classes Mohd Saberi Mohamad 1,2 , Sigeru Omatu 1 , Safaai Deris 2 , Michifumi Yoshioka 1 1 Department of Computer Science and Intelligent Systems, Osaka Prefecture University, Sakai, Osaka 599-8531, Japan mohd.saberi@sig.cs.osakafu-u.ac.jp,{omatu, yoshioka}@cs.osakafu-u.ac.jp 2 Department of Software Engineering, Universiti Teknologi Malaysia, 81310 Skudai, Johore, Malaysia safaai@utm.my Abstract—The process of gene selection for the cancer classification faces with a major problem due to the properties of the data such as the small number of samples compared to the huge number of genes, irrelevant genes, and noisy data. Hence, this paper aims to select a near-optimal (small) subset of informative genes that is most relevant for the cancer classification. To achieve the aim, a three-stage method has been proposed. It has three stages: 1) pre-selecting genes using a filter method; 2) optimizing the gene subset using a multi-objective hybrid method; 3) analyzing the frequency of appearance of each gene. By performing experiments on three public gene expression data sets, classification accuracies and the number of selected genes of the proposed method are better than those of other experimented methods and previous works. A list of informative genes in the final gene subsets is also presented for biological usage. Keywords-component; cancer classification; genetic algorithm; gene selection; gene expression data; three-stage method; I. INTRODUCTION Microarray technology is used to measure the expression levels of thousands of genes simultaneously, and finally produce gene expression data. A comparison between the gene expression levels of cancerous and normal tissues can also be done. This comparison is useful to select those genes that might anticipate the clinical behavior of cancers. Thus, there is a need to select informative genes that contribute to a cancerous state. An informative gene is useful for cancer classification. However, the gene selection process poses a major challenge because of the following characteristics of gene expression data: the huge number of genes compared to the small number of samples (high- dimensional data), irrelevant genes, and noisy data. To overcome the challenge, a gene selection method is used to select a subset of genes for cancer classification. The gene selection method has several advantages such as maintaining or improving classification accuracy, reducing the dimensionality of data, and removing irrelevant and noisy genes. There are two types of gene selection methods [1]: if a gene selection method is carried out independently from a classifier, it belongs to the filter approach; otherwise, it is said to follow a hybrid (wrapper) approach. In the early era of microarray analysis, most previous works have used the filter approach to select genes because it is computationally more efficient than the hybrid approach [2-3]. However, the filter approach results in inclusion of irrelevant and noisy genes in a gene subset for the cancer classification. The hybrid approach usually provides greater accuracy than the filter approach. Until now, several hybrid methods, especially a combination between a genetic algorithm (GA) and a support vector machine (SVM) classifier (GASVM), have been implemented to select informative genes [1],[4- 8]. The drawbacks of the hybrid methods (GASVM-based methods) in the previous works are [1],[4-8]: 1) intractable to efficiently produce a small subset of informative genes when the total number of genes is too large (high- dimensional data); 2) the high risk of over-fitting problems. In order to solve the problems derived from gene expression data and overcome the limitations of the hybrid methods in the previous works [1],[4-8], we propose a three-stage method (3-SGS) for gene selection. This method is able to perform well in the high-dimensional data and reduce the high risk of over-fitting problems since it has three stages as follows: stage 1 for producing a subset of genes; stage 2 for resulting near-optimal subsets of genes; stage 3 for yielding a small (final) subset of informative genes based on the frequency of appearance for each gene in the near-optimal subsets. The diagnostic goal is to develop a medical procedure based on the least number of possible genes to detect diseases. Thus, the ultimate goal of this paper is to select a small subset of informative genes (minimize the number of selected genes) for yielding high cancer classification accuracy (maximize the classification accuracy). To achieve the goal, we adopt 3-SGS where 3- SGS is evaluated on three real gene expression data sets of tumor samples. The outline of this paper is as follows: Sections 2 and 3 discuss previous works and the detail of the proposed 3- SGS, respectively. In Section 4, gene expression data sets, experimental setup, and experimental results are described. The conclusion of this paper is provided in Section 5. II. PREVIOUS WORKS Several hybrid methods, i.e., GASVM-based methods have been proposed for genes selection of gene expression 2010 International Conference on Intelligent Systems, Modelling and Simulation 978-0-7695-3973-7/10 $26.00 © 2010 IEEE DOI 10.1109/ISMS.2010.39 158 2010 International Conference on Intelligent Systems, Modelling and Simulation 978-0-7695-3973-7/10 $26.00 © 2010 IEEE DOI 10.1109/ISMS.2010.39 158 Authorized licensed use limited to: UNIVERSITY TEKNOLOGI MALAYSIA. Downloaded on February 28,2010 at 21:04:22 EST from IEEE Xplore. Restrictions apply.