M. Chetty, A. Ngom, and S. Ahmad (Eds.): PRIB 2008, LNBI 5265, pp. 121–131, 2008. © Springer-Verlag Berlin Heidelberg 2008 Feature Selection and Classification for Small Gene Sets Gregor Stiglic 1,2 , Juan J. Rodriguez 3 , and Peter Kokol 1,2 1 University of Maribor, Faculty of Health Sciences, Zitna ulica 15, 2000 Maribor, Slovenia 2 University of Maribor, Faculty of Electrical Engineering and Computer Science, Smetanova 17, 2000 Maribor, Slovenia {gregor.stiglic,kokol}@uni-mb.si 3 University of Burgos, c/ Francisco de Vitoria s/n, 09006 Burgos, Spain jjrodriguez@ubu.es Abstract. Random Forests, Support Vector Machines and k-Nearest Neighbors are successful and proven classification techniques that are widely used for dif- ferent kinds of classification problems. One of them is classification of genomic and proteomic data that is known as a problem with extremely high dimension- ality and therefore demands suited classification techniques. In this domain they are usually combined with gene selection techniques to provide optimal classi- fication accuracy rates. Another reason for reducing the dimensionality of such datasets is their interpretability. It is much easier to interpret a small set of ranked genes than 20 or 30 thousands of unordered genes. In this paper we pre- sent a classification ensemble of decision trees called Rotation Forest and evaluate its classification performance on small subsets of ranked genes for 14 genomic and proteomic classification problems. An important feature of Rota- tion Forest is demonstrated – i.e. robustness and high classification accuracy us- ing small sets of genes. Keywords: Gene expression analysis, machine learning, feature selection, en- semble of classifiers. 1 Introduction There are many new classification methods and variants of existing techniques for classification problems. One of them is Random Forests classifier that was presented in [1] by Breiman and Cutler. It has proven to be fast, robust and very accurate tech- nique that can be compared with the best classifiers (e.g. Support Vector Machines [2] or some of the most efficient ensemble based classification techniques) [3]. Most of these techniques are also used in genomic and proteomic classification problems where classifiers need to be specialized for high dimensional problems. The other option is integration of feature pre-selection into classification process where initial feature set is reduced before the classification is done. Most of the early experiments using microarray gene expression datasets used simple statistical methods of gene ranking to reduce the initial set of attributes. Recently more advanced feature selec- tion methods from the machine learning field are applied to pre-selection step in ge- nomic and proteomic classification problems. Although a small number of genes is