Wavelet selection for disease classification by DNA microarray data Loris Nanni * , Alessandra Lumini DEIS, IEIIT – CNR, Università di Bologna, Viale Risorgimento 2, 40136 Bologna, Italy article info Keywords: Microarray data 1-D wavelet transform Support vector machine Fusion of classifiers abstract The microarrays report the measures of the expression levels of tens of thousands of genes, this high dimensional feature vector contains also irrelevant information for accurate classification. Moreover, only few training samples are available, hence for avoiding the curse of dimensionality problem a feature reduction should be performed before the classification step. Here, we proposed a set of orthogonal wavelet detail coefficients of different wavelet mothers to extract the features from the microarray data. We propose to use a multi-classifiers where each classifier, a support vector machine, is trained using a different set of detail coefficients, the classifiers are com- bined by ‘‘sum rule”. The detail coefficients set selection is performed by running Sequential Forward Floating Selection (SFFS). The goodness of the proposed method is validated using the area under the ROC curve as performance indicator, the experiments are carried out on four-datasets: Breast dataset; Ovarian dataset; Lung data- set; Prostate dataset. The results show that the proposed method outperforms the performance that can be obtained by a single set of detail coefficients. Moreover, we have shown that, also using as features the detail coefficients, a random subspace of clas- sifiers outperforms the stand-alone classifiers. Ó 2010 Elsevier Ltd. All rights reserved. 1. Introduction The gene activity is very useful since it could be used for several applications in biological and biomedical studies (Margalit, Somech, Amariglio, & Rechavi, 2005), e.g. to distinguish between malignant pleural mesothelioma and adenocarcinoma of the lung, or to identify patients who might benefit from adjuvant chemother- apy. Notice that even if all of human cells contain the same genetic material, the activity of genes may vary among cells. A well known technology for extracting the simultaneous measurement of the activities of tens of thousands of genes is the DNA microarray. From the machine learning point of view the output of the DNA microarray is used for disease classification (e.g. heart disease, tu- mor recognition) starting from the analysis of a given tissue sample (Khan et al., 2001; Lee, Rodriguez, & Madabhushi, 2008; Moon et al., 2007). Several examples are yet reported in the literature, mainly for cancer classification (Alon et al., 1999, Ramaswamy et al., 2001; Rao et al., 2001) or on the effectiveness of different therapies (Golub et al., 1999; Tibshirani, Hastie, Narasimhan, & Chu, 2002). In the last years, as in several other fields of the machine learn- ing, the new trend is to study ensemble of classifier for improving the performance respect to that obtained by a stand-alone method, example of works where ensemble of classifiers are tested for clas- sifying microarray data are (Moon et al., 2007; Nanni & Lumini, 2007a; Tan & Gilbert, 2003). The most important published ensem- ble methods for classifying microarray data are reported in Table 1 (extracted from Nanni and Lumini (2009). As in Nanni and Lumini (2009) three categories are used for the analysis of different algo- rithms for multi-classifier combination: Perturbation of the patterns (A), each classifier is trained using a different training set or different weights for the patterns (e.g. Bagging or Boosting). Perturbation of the features (B), each classifier is trained using a different feature set (e.g. Random Forest or Random Subspace); Perturbation of the classifiers (C), each classifier has different values for its parameters or different classifiers are combined. One of the most used method is the boosting, this method is particularly interesting in the microarray data classification since it naturally performs a feature selection (Ben-Dor et al., 2000; Dudoit & Fridlyand 2003). We want to stress that the microarray classification is based on classifying data which are made by sev- eral thousands of features and only few dozens of patterns for each class are available, and this fact may negatively affects the perfor- mance of a classifier (curse of dimensionality) (Jain, Duin, & Mao, 2000). Moreover, in Hua and Lai (2007) a boosting method is used 0957-4174/$ - see front matter Ó 2010 Elsevier Ltd. All rights reserved. doi:10.1016/j.eswa.2010.07.104 * Corresponding author. E-mail address: lnanni@deis.unibo.it (L. Nanni). Expert Systems with Applications 38 (2011) 990–995 Contents lists available at ScienceDirect Expert Systems with Applications journal homepage: www.elsevier.com/locate/eswa