Computational Biology and Chemistry 29 (2005) 37–46
Gene selection from microarray data for cancer classification—a
machine learning approach
Yu Wang
a, ∗
, Igor V. Tetko
a
, Mark A. Hall
b
, Eibe Frank
b
, Axel Facius
a
,
Klaus F.X. Mayer
a
, Hans W. Mewes
a,c
a
Institute for Bioinformatics, German Research Center for Environment and Health, Ingolst¨ adter Landstraβe 1, D-85764 Neuherberg, Germany
b
Department of Computer Science, University of Waikato, Private Bag 3105, Hamilton, New Zealand
c
Department of Genome-Oriented Bioinformatics, Wissenschaftszentrum Weihenstephan, Technische Universit¨ at M ¨ unchen,
Alte Akademie 10, D-85354 Freising-Weihenstephan, Germany
Received 8 September 2004; received in revised form 18 November 2004; accepted 22 November 2004
Abstract
A DNA microarray can track the expression levels of thousands of genes simultaneously. Previous research has demonstrated that this
technology can be useful in the classification of cancers. Cancer microarray data normally contains a small number of samples which have a
large number of gene expression levels as features. To select relevant genes involved in different types of cancer remains a challenge. In order
to extract useful gene information from cancer microarray data and reduce dimensionality, feature selection algorithms were systematically
investigated in this study. Using a correlation-based feature selector combined with machine learning algorithms such as decision trees, naïve
Bayes and support vector machines, we show that classification performance at least as good as published results can be obtained on acute
leukemia and diffuse large B-cell lymphoma microarray data sets. We also demonstrate that a combined use of different classification and
feature selection approaches makes it possible to select relevant genes with high confidence. This is also the first paper which discusses both
computational and biological evidence for the involvement of zyxin in leukaemogenesis.
© 2004 Elsevier Ltd. All rights reserved.
Keywords: Microarray; Gene selection; Machine learning; Cancer classification; Feature Selection
1. Introduction
Accurate cancer diagnosis is vital for the successful ap-
plication of specific therapies. Although cancer classifica-
tion has improved over the last decade, there is still a need
for a fully automated and less subjective method for cancer
diagnosis. Recent studies demonstrated that DNA microar-
rays could provide useful information for cancer classifica-
tion at the gene expression level due to their ability to measure
the abundance of messenger ribonucleic acid (mRNA) tran-
scripts for thousands of genes simultaneously.
Several machine learning algorithms have already been
applied to classifying tumors using microarray data. Vot-
ing machines and self-organising maps (SOM) were used to
∗
Corresponding author. Tel.: +49 89 3187 2627; fax: +49 89 3187 3585.
E-mail address: yu.wang@gsf.de (Y. Wang).
analyse acute leukemia (Golub et al., 1999). Support vector
machines (SVMs) were applied to multi-class cancer diag-
nosis by (Ramaswamy et al., 2001). Hierarchical clustering
was used to analyse colon tumor (Alon et al., 1999). The
best classification results are reported by Li et al. (2003) and
Antonov et al. (2004). Li et al. employed a rule discovery
method and Antonov et al. maximal margin linear program-
ming (MAMA).
Given the nature of cancer microarray data, which usually
consists of a few hundred samples with thousands of genes
as features, the analysis has to be carried out carefully. Work
in such a high dimensional space is extremely difficult if not
impossible. One straightforward approach to select relevant
genes is the application of standard parametric tests such as
the t-test (Thomas et al., 2001; Tsai et al., 2003) and a non-
parametric test such as the Wilcoxon score test (Thomas et
al., 2001; Antoniadis et al., 2003). Wilks’s Lambda score was
1476-9271/$ – see front matter © 2004 Elsevier Ltd. All rights reserved.
doi:10.1016/j.compbiolchem.2004.11.001