Chapter 27
Feature Selection Algorithms for Mining High
Dimensional DNA Microarray Data
David J. Dittman, Taghi M. Khoshgoftaar, Randall Wald,
and Jason Van Hulse
1 Introduction
The World Heath Organization identified cancer as the second largest contributor
to death worldwide, surpassed only by cardiovascular disease. The death count for
cancer in 2002 was 7.1 million and is expected to rise to 11.5 million annually
by 2030 [17]. In 2009, the International Conference on Machine Learning and
Applications, or ICMLA, proposed a challenge regarding gene expression profiles
in human cancers. The goal of the challenge was the “identification of functional
clusters of genes from gene expression profiles in three major cancers: breast, colon
and lung.” The identification of these clusters may further our understanding of
cancer and open up new avenues of research.
One of the main goals of data mining is to classify instances given specific
information. Classification has many important applications, ranging from finding
problem areas with a computer program’s code to predicting if a person is likely to
have a specific disease. However, one of the biggest obstacles to proper classification
is high dimensional data (data where there are a large number of features in each
instance). A very useful tool for working with high dimensional data is feature
selection, which is the process of choosing a subset of features and analyzing only
those features. Only the selected features will be used for building models; the rest
are discarded. Despite the elimination of possible data, feature selection can lead to
the creation of more efficient and accurate classifiers [24].
An example of a type of data which absolutely needs feature selection is DNA
microarray data. The creation of the DNA microarray was a recent technological
and chemical advance in the field of genetic research. To take advantage of the
fact that messenger RNA (mRNA), the blueprints that encode all of the proteins
made within a given cell, will readily bind to complementary DNA (cDNA), the
D.J. Dittman () • T.M. Khoshgoftaar • R. Wald • J.V. Hulse
FAU, Boca Raton, FL
e-mail: dittmandj@gmail.com; khoshgof@fau.edu; rwald1@fau.edu; jvanhulse@gmail.com
B. Furht and A. Escalante (eds.), Handbook of Data Intensive Computing,
DOI 10.1007/978-1-4614-1415-5 27, © Springer Science+Business Media, LLC 2011
685