Soft Comput (2006) 10: 346–350 DOI 10.1007/s00500-005-0493-9 FOCUS Y. Zhao · C. K. Kwoh Fast leave-one-out evaluation for dynamic gene selection Published online: 27 April 2005 © Springer-Verlag 2005 Abstract Gene selection procedure is a necessary step to increase the accuracy of machine learning algorithms that help in disease diagnosis based on gene expression data. This is commonly known as a feature subset selection problem in machine learning domain. A fast leave-one-out (LOO) evalu- ation formula for least-squares support vector machines (LS- SVMs) is introduced here that can guide our backward feature selection process. Based on that, we propose a fast LOO guided feature selection (LGFS) algorithm. The gene selec- tion step size is dynamically adjusted according to the LOO accuracy estimation. For our experiments, the application of LGFS to the gene selection process improves the classifier accuracy and reduces the number of features required as well. The least number of genes that can maximize the disease classification accuracy is automatically determined by our algorithm. 1 Introduction Gene expression data from DNA microarrays is a great source of information for scientists to understand the simultaneous activities of thousands of genes. One of the characteristics of the gene expression data is that the number of genes (often more than 1000) far exceeds the number of samples(often less than 100). Hence the identification of those important genes that have the most discriminant power is of consid- erable interest to both biologists and medical professionals, where the research concentration can be narrowed to small set of those important genes. In machine learning, it is a misconception that the more features included the higher the classification accuracy will be. It has been shown that the large proportion of the genes in a microarray are not relevant to the discrimination of disease Y. Zhao (B ) · C. K. Kwoh Bioinformatics Research Centre (BIRC), 50 Nanyang Drive, Research Techno Plaza, Singapore 637553 E-mail: zhaoying2000@163.com types [1]. The irrelevant features not only bring the unneces- sary burden to the computation process but also degrade the classifier performance as a result. Therefore feature selection comes into play. Feature selection techniques can be generally classified into two groups, i.e. the filter approach and the wrapper ap- proach. The key difference between them is that the wrap- per approach uses the learning algorithm itself as part of the evaluation criteria while the filter approach does not. In other words, the wrapper approach is algorithm dependent while the filter approach is algorithm independent. Usually the fil- ter approach incurs much less computational cost than that of the wrapper approach. However, as a price to pay, the pre- dicting accuracy for the filter approach is generally less than that of the wrapper approach. This fact was also reported by the authors in paper [2]. From the above discussion, we adopted the wrapper ap- proach to our feature selection problem. It is noticed that the LOO test, a special case of k-fold cross validation, gives al- most unbiased estimates of the accuracy [3]. This is an impor- tant advantage of LOO test. Another advantage is that LOO test is a stable estimate. That is to say, as all the algorithm parameters are fixed, LOO test always gives the same accu- racy value at any time. In spite of its almost perfect accuracy estimation, the computational cost of a LOO test is usually too high to afford. To perform a LOO test, one sample is left out as a test data and the rest is used as the training dataset and then the next sample is left out and the process repeats until all the samples have been tested once. Thus n (number of samples in the dataset) times training procedures are needed to just estimate the accuracy for one suite of parameters. The computational cost will soon become unaffordable with the increase of sample size and the cardinality of parameters. As a compromise, various LOO error upper bounds are formu- lated. Take support vector machines (SVMs) for examples. Their LOO error bounds include xi-alpha bound [4], gener- alized approximate cross validation [5], span bound [6], VC bound [7], radius-margin bound [6] etc. Despite the saving of computational cost with these bounds, the accuracy of these bounds that reflect the true LOO errors is still a problem.