Chapter 27 Feature Selection Algorithms for Mining High Dimensional DNA Microarray Data David J. Dittman, Taghi M. Khoshgoftaar, Randall Wald, and Jason Van Hulse 1 Introduction The World Heath Organization identiﬁed cancer as the second largest contributor to death worldwide, surpassed only by cardiovascular disease. The death count for cancer in 2002 was 7.1 million and is expected to rise to 11.5 million annually by 2030 [17]. In 2009, the International Conference on Machine Learning and Applications, or ICMLA, proposed a challenge regarding gene expression proﬁles in human cancers. The goal of the challenge was the “identiﬁcation of functional clusters of genes from gene expression proﬁles in three major cancers: breast, colon and lung.” The identiﬁcation of these clusters may further our understanding of cancer and open up new avenues of research. One of the main goals of data mining is to classify instances given speciﬁc information. Classiﬁcation has many important applications, ranging from ﬁnding problem areas with a computer program’s code to predicting if a person is likely to have a speciﬁc disease. However, one of the biggest obstacles to proper classiﬁcation is high dimensional data (data where there are a large number of features in each instance). A very useful tool for working with high dimensional data is feature selection, which is the process of choosing a subset of features and analyzing only those features. Only the selected features will be used for building models; the rest are discarded. Despite the elimination of possible data, feature selection can lead to the creation of more efﬁcient and accurate classiﬁers [24]. An example of a type of data which absolutely needs feature selection is DNA microarray data. The creation of the DNA microarray was a recent technological and chemical advance in the ﬁeld of genetic research. To take advantage of the fact that messenger RNA (mRNA), the blueprints that encode all of the proteins made within a given cell, will readily bind to complementary DNA (cDNA), the D.J. Dittman () • T.M. Khoshgoftaar • R. Wald • J.V. Hulse FAU, Boca Raton, FL e-mail: dittmandj@gmail.com; khoshgof@fau.edu; rwald1@fau.edu; jvanhulse@gmail.com B. Furht and A. Escalante (eds.), Handbook of Data Intensive Computing, DOI 10.1007/978-1-4614-1415-5 27, © Springer Science+Business Media, LLC 2011 685