Soft Comput (2006) 10: 346–350
DOI 10.1007/s00500-005-0493-9
FOCUS
Y. Zhao · C. K. Kwoh
Fast leave-one-out evaluation for dynamic gene selection
Published online: 27 April 2005
© Springer-Verlag 2005
Abstract Gene selection procedure is a necessary step to
increase the accuracy of machine learning algorithms that
help in disease diagnosis based on gene expression data. This
is commonly known as a feature subset selection problem in
machine learning domain. A fast leave-one-out (LOO) evalu-
ation formula for least-squares support vector machines (LS-
SVMs) is introduced here that can guide our backward feature
selection process. Based on that, we propose a fast LOO
guided feature selection (LGFS) algorithm. The gene selec-
tion step size is dynamically adjusted according to the LOO
accuracy estimation. For our experiments, the application of
LGFS to the gene selection process improves the classifier
accuracy and reduces the number of features required as well.
The least number of genes that can maximize the disease
classification accuracy is automatically determined by our
algorithm.
1 Introduction
Gene expression data from DNA microarrays is a great source
of information for scientists to understand the simultaneous
activities of thousands of genes. One of the characteristics of
the gene expression data is that the number of genes (often
more than 1000) far exceeds the number of samples(often
less than 100). Hence the identification of those important
genes that have the most discriminant power is of consid-
erable interest to both biologists and medical professionals,
where the research concentration can be narrowed to small
set of those important genes.
In machine learning, it is a misconception that the more
features included the higher the classification accuracy will
be. It has been shown that the large proportion of the genes in
a microarray are not relevant to the discrimination of disease
Y. Zhao (B ) · C. K. Kwoh
Bioinformatics Research Centre (BIRC),
50 Nanyang Drive, Research Techno Plaza,
Singapore 637553
E-mail: zhaoying2000@163.com
types [1]. The irrelevant features not only bring the unneces-
sary burden to the computation process but also degrade the
classifier performance as a result. Therefore feature selection
comes into play.
Feature selection techniques can be generally classified
into two groups, i.e. the filter approach and the wrapper ap-
proach. The key difference between them is that the wrap-
per approach uses the learning algorithm itself as part of the
evaluation criteria while the filter approach does not. In other
words, the wrapper approach is algorithm dependent while
the filter approach is algorithm independent. Usually the fil-
ter approach incurs much less computational cost than that
of the wrapper approach. However, as a price to pay, the pre-
dicting accuracy for the filter approach is generally less than
that of the wrapper approach. This fact was also reported by
the authors in paper [2].
From the above discussion, we adopted the wrapper ap-
proach to our feature selection problem. It is noticed that the
LOO test, a special case of k-fold cross validation, gives al-
most unbiased estimates of the accuracy [3]. This is an impor-
tant advantage of LOO test. Another advantage is that LOO
test is a stable estimate. That is to say, as all the algorithm
parameters are fixed, LOO test always gives the same accu-
racy value at any time. In spite of its almost perfect accuracy
estimation, the computational cost of a LOO test is usually
too high to afford. To perform a LOO test, one sample is left
out as a test data and the rest is used as the training dataset and
then the next sample is left out and the process repeats until
all the samples have been tested once. Thus n (number of
samples in the dataset) times training procedures are needed
to just estimate the accuracy for one suite of parameters. The
computational cost will soon become unaffordable with the
increase of sample size and the cardinality of parameters. As
a compromise, various LOO error upper bounds are formu-
lated. Take support vector machines (SVMs) for examples.
Their LOO error bounds include xi-alpha bound [4], gener-
alized approximate cross validation [5], span bound [6], VC
bound [7], radius-margin bound [6] etc. Despite the saving of
computational cost with these bounds, the accuracy of these
bounds that reflect the true LOO errors is still a problem.