Chapter 27 Feature Selection Algorithms for Mining High Dimensional DNA Microarray Data David J. Dittman, Taghi M. Khoshgoftaar, Randall Wald, and Jason Van Hulse 1 Introduction The World Heath Organization identified cancer as the second largest contributor to death worldwide, surpassed only by cardiovascular disease. The death count for cancer in 2002 was 7.1 million and is expected to rise to 11.5 million annually by 2030 [17]. In 2009, the International Conference on Machine Learning and Applications, or ICMLA, proposed a challenge regarding gene expression profiles in human cancers. The goal of the challenge was the “identification of functional clusters of genes from gene expression profiles in three major cancers: breast, colon and lung.” The identification of these clusters may further our understanding of cancer and open up new avenues of research. One of the main goals of data mining is to classify instances given specific information. Classification has many important applications, ranging from finding problem areas with a computer program’s code to predicting if a person is likely to have a specific disease. However, one of the biggest obstacles to proper classification is high dimensional data (data where there are a large number of features in each instance). A very useful tool for working with high dimensional data is feature selection, which is the process of choosing a subset of features and analyzing only those features. Only the selected features will be used for building models; the rest are discarded. Despite the elimination of possible data, feature selection can lead to the creation of more efficient and accurate classifiers [24]. An example of a type of data which absolutely needs feature selection is DNA microarray data. The creation of the DNA microarray was a recent technological and chemical advance in the field of genetic research. To take advantage of the fact that messenger RNA (mRNA), the blueprints that encode all of the proteins made within a given cell, will readily bind to complementary DNA (cDNA), the D.J. Dittman () • T.M. Khoshgoftaar • R. Wald • J.V. Hulse FAU, Boca Raton, FL e-mail: dittmandj@gmail.com; khoshgof@fau.edu; rwald1@fau.edu; jvanhulse@gmail.com B. Furht and A. Escalante (eds.), Handbook of Data Intensive Computing, DOI 10.1007/978-1-4614-1415-5 27, © Springer Science+Business Media, LLC 2011 685