A Three-Stage Method to Select Informative Genes from Gene Expression Data in
Classifying Cancer Classes
Mohd Saberi Mohamad
1,2
, Sigeru Omatu
1
, Safaai Deris
2
, Michifumi Yoshioka
1
1
Department of Computer Science and Intelligent Systems, Osaka Prefecture University,
Sakai, Osaka 599-8531, Japan
mohd.saberi@sig.cs.osakafu-u.ac.jp,{omatu, yoshioka}@cs.osakafu-u.ac.jp
2
Department of Software Engineering, Universiti Teknologi Malaysia,
81310 Skudai, Johore, Malaysia
safaai@utm.my
Abstract—The process of gene selection for the cancer
classification faces with a major problem due to the properties
of the data such as the small number of samples compared to
the huge number of genes, irrelevant genes, and noisy data.
Hence, this paper aims to select a near-optimal (small) subset
of informative genes that is most relevant for the cancer
classification. To achieve the aim, a three-stage method has
been proposed. It has three stages: 1) pre-selecting genes
using a filter method; 2) optimizing the gene subset using a
multi-objective hybrid method; 3) analyzing the frequency of
appearance of each gene. By performing experiments on three
public gene expression data sets, classification accuracies and
the number of selected genes of the proposed method are
better than those of other experimented methods and previous
works. A list of informative genes in the final gene subsets is
also presented for biological usage.
Keywords-component; cancer classification; genetic
algorithm; gene selection; gene expression data; three-stage
method;
I. INTRODUCTION
Microarray technology is used to measure the
expression levels of thousands of genes simultaneously, and
finally produce gene expression data. A comparison
between the gene expression levels of cancerous and normal
tissues can also be done. This comparison is useful to select
those genes that might anticipate the clinical behavior of
cancers. Thus, there is a need to select informative genes
that contribute to a cancerous state. An informative gene is
useful for cancer classification. However, the gene selection
process poses a major challenge because of the following
characteristics of gene expression data: the huge number of
genes compared to the small number of samples (high-
dimensional data), irrelevant genes, and noisy data.
To overcome the challenge, a gene selection method is
used to select a subset of genes for cancer classification.
The gene selection method has several advantages such as
maintaining or improving classification accuracy, reducing
the dimensionality of data, and removing irrelevant and
noisy genes.
There are two types of gene selection methods [1]: if a
gene selection method is carried out independently from a
classifier, it belongs to the filter approach; otherwise, it is
said to follow a hybrid (wrapper) approach. In the early era
of microarray analysis, most previous works have used the
filter approach to select genes because it is computationally
more efficient than the hybrid approach [2-3]. However, the
filter approach results in inclusion of irrelevant and noisy
genes in a gene subset for the cancer classification. The
hybrid approach usually provides greater accuracy than the
filter approach. Until now, several hybrid methods,
especially a combination between a genetic algorithm (GA)
and a support vector machine (SVM) classifier (GASVM),
have been implemented to select informative genes [1],[4-
8]. The drawbacks of the hybrid methods (GASVM-based
methods) in the previous works are [1],[4-8]: 1) intractable
to efficiently produce a small subset of informative genes
when the total number of genes is too large (high-
dimensional data); 2) the high risk of over-fitting problems.
In order to solve the problems derived from gene
expression data and overcome the limitations of the hybrid
methods in the previous works [1],[4-8], we propose a
three-stage method (3-SGS) for gene selection. This method
is able to perform well in the high-dimensional data and
reduce the high risk of over-fitting problems since it has
three stages as follows: stage 1 for producing a subset of
genes; stage 2 for resulting near-optimal subsets of genes;
stage 3 for yielding a small (final) subset of informative
genes based on the frequency of appearance for each gene
in the near-optimal subsets. The diagnostic goal is to
develop a medical procedure based on the least number of
possible genes to detect diseases. Thus, the ultimate goal of
this paper is to select a small subset of informative genes
(minimize the number of selected genes) for yielding high
cancer classification accuracy (maximize the classification
accuracy). To achieve the goal, we adopt 3-SGS where 3-
SGS is evaluated on three real gene expression data sets of
tumor samples.
The outline of this paper is as follows: Sections 2 and 3
discuss previous works and the detail of the proposed 3-
SGS, respectively. In Section 4, gene expression data sets,
experimental setup, and experimental results are described.
The conclusion of this paper is provided in Section 5.
II. PREVIOUS WORKS
Several hybrid methods, i.e., GASVM-based methods
have been proposed for genes selection of gene expression
2010 International Conference on Intelligent Systems, Modelling and Simulation
978-0-7695-3973-7/10 $26.00 © 2010 IEEE
DOI 10.1109/ISMS.2010.39
158
2010 International Conference on Intelligent Systems, Modelling and Simulation
978-0-7695-3973-7/10 $26.00 © 2010 IEEE
DOI 10.1109/ISMS.2010.39
158
Authorized licensed use limited to: UNIVERSITY TEKNOLOGI MALAYSIA. Downloaded on February 28,2010 at 21:04:22 EST from IEEE Xplore. Restrictions apply.