Computational Biology and Chemistry 29 (2005) 37–46 Gene selection from microarray data for cancer classification—a machine learning approach Yu Wang a, ∗ , Igor V. Tetko a , Mark A. Hall b , Eibe Frank b , Axel Facius a , Klaus F.X. Mayer a , Hans W. Mewes a,c a Institute for Bioinformatics, German Research Center for Environment and Health, Ingolst¨ adter Landstraβe 1, D-85764 Neuherberg, Germany b Department of Computer Science, University of Waikato, Private Bag 3105, Hamilton, New Zealand c Department of Genome-Oriented Bioinformatics, Wissenschaftszentrum Weihenstephan, Technische Universit¨ at M ¨ unchen, Alte Akademie 10, D-85354 Freising-Weihenstephan, Germany Received 8 September 2004; received in revised form 18 November 2004; accepted 22 November 2004 Abstract A DNA microarray can track the expression levels of thousands of genes simultaneously. Previous research has demonstrated that this technology can be useful in the classification of cancers. Cancer microarray data normally contains a small number of samples which have a large number of gene expression levels as features. To select relevant genes involved in different types of cancer remains a challenge. In order to extract useful gene information from cancer microarray data and reduce dimensionality, feature selection algorithms were systematically investigated in this study. Using a correlation-based feature selector combined with machine learning algorithms such as decision trees, naïve Bayes and support vector machines, we show that classification performance at least as good as published results can be obtained on acute leukemia and diffuse large B-cell lymphoma microarray data sets. We also demonstrate that a combined use of different classification and feature selection approaches makes it possible to select relevant genes with high confidence. This is also the first paper which discusses both computational and biological evidence for the involvement of zyxin in leukaemogenesis. © 2004 Elsevier Ltd. All rights reserved. Keywords: Microarray; Gene selection; Machine learning; Cancer classification; Feature Selection 1. Introduction Accurate cancer diagnosis is vital for the successful ap- plication of specific therapies. Although cancer classifica- tion has improved over the last decade, there is still a need for a fully automated and less subjective method for cancer diagnosis. Recent studies demonstrated that DNA microar- rays could provide useful information for cancer classifica- tion at the gene expression level due to their ability to measure the abundance of messenger ribonucleic acid (mRNA) tran- scripts for thousands of genes simultaneously. Several machine learning algorithms have already been applied to classifying tumors using microarray data. Vot- ing machines and self-organising maps (SOM) were used to ∗ Corresponding author. Tel.: +49 89 3187 2627; fax: +49 89 3187 3585. E-mail address: yu.wang@gsf.de (Y. Wang). analyse acute leukemia (Golub et al., 1999). Support vector machines (SVMs) were applied to multi-class cancer diag- nosis by (Ramaswamy et al., 2001). Hierarchical clustering was used to analyse colon tumor (Alon et al., 1999). The best classification results are reported by Li et al. (2003) and Antonov et al. (2004). Li et al. employed a rule discovery method and Antonov et al. maximal margin linear program- ming (MAMA). Given the nature of cancer microarray data, which usually consists of a few hundred samples with thousands of genes as features, the analysis has to be carried out carefully. Work in such a high dimensional space is extremely difficult if not impossible. One straightforward approach to select relevant genes is the application of standard parametric tests such as the t-test (Thomas et al., 2001; Tsai et al., 2003) and a non- parametric test such as the Wilcoxon score test (Thomas et al., 2001; Antoniadis et al., 2003). Wilks’s Lambda score was 1476-9271/$ – see front matter © 2004 Elsevier Ltd. All rights reserved. doi:10.1016/j.compbiolchem.2004.11.001