Wavelet selection for disease classiﬁcation by DNA microarray data Loris Nanni * , Alessandra Lumini DEIS, IEIIT – CNR, Università di Bologna, Viale Risorgimento 2, 40136 Bologna, Italy article info Keywords: Microarray data 1-D wavelet transform Support vector machine Fusion of classiﬁers abstract The microarrays report the measures of the expression levels of tens of thousands of genes, this high dimensional feature vector contains also irrelevant information for accurate classiﬁcation. Moreover, only few training samples are available, hence for avoiding the curse of dimensionality problem a feature reduction should be performed before the classiﬁcation step. Here, we proposed a set of orthogonal wavelet detail coefﬁcients of different wavelet mothers to extract the features from the microarray data. We propose to use a multi-classiﬁers where each classiﬁer, a support vector machine, is trained using a different set of detail coefﬁcients, the classiﬁers are com- bined by ‘‘sum rule”. The detail coefﬁcients set selection is performed by running Sequential Forward Floating Selection (SFFS). The goodness of the proposed method is validated using the area under the ROC curve as performance indicator, the experiments are carried out on four-datasets: Breast dataset; Ovarian dataset; Lung data- set; Prostate dataset. The results show that the proposed method outperforms the performance that can be obtained by a single set of detail coefﬁcients. Moreover, we have shown that, also using as features the detail coefﬁcients, a random subspace of clas- siﬁers outperforms the stand-alone classiﬁers. Ó 2010 Elsevier Ltd. All rights reserved. 1. Introduction The gene activity is very useful since it could be used for several applications in biological and biomedical studies (Margalit, Somech, Amariglio, & Rechavi, 2005), e.g. to distinguish between malignant pleural mesothelioma and adenocarcinoma of the lung, or to identify patients who might beneﬁt from adjuvant chemother- apy. Notice that even if all of human cells contain the same genetic material, the activity of genes may vary among cells. A well known technology for extracting the simultaneous measurement of the activities of tens of thousands of genes is the DNA microarray. From the machine learning point of view the output of the DNA microarray is used for disease classiﬁcation (e.g. heart disease, tu- mor recognition) starting from the analysis of a given tissue sample (Khan et al., 2001; Lee, Rodriguez, & Madabhushi, 2008; Moon et al., 2007). Several examples are yet reported in the literature, mainly for cancer classiﬁcation (Alon et al., 1999, Ramaswamy et al., 2001; Rao et al., 2001) or on the effectiveness of different therapies (Golub et al., 1999; Tibshirani, Hastie, Narasimhan, & Chu, 2002). In the last years, as in several other ﬁelds of the machine learn- ing, the new trend is to study ensemble of classiﬁer for improving the performance respect to that obtained by a stand-alone method, example of works where ensemble of classiﬁers are tested for clas- sifying microarray data are (Moon et al., 2007; Nanni & Lumini, 2007a; Tan & Gilbert, 2003). The most important published ensem- ble methods for classifying microarray data are reported in Table 1 (extracted from Nanni and Lumini (2009). As in Nanni and Lumini (2009) three categories are used for the analysis of different algo- rithms for multi-classiﬁer combination:  Perturbation of the patterns (A), each classiﬁer is trained using a different training set or different weights for the patterns (e.g. Bagging or Boosting).  Perturbation of the features (B), each classiﬁer is trained using a different feature set (e.g. Random Forest or Random Subspace);  Perturbation of the classiﬁers (C), each classiﬁer has different values for its parameters or different classiﬁers are combined. One of the most used method is the boosting, this method is particularly interesting in the microarray data classiﬁcation since it naturally performs a feature selection (Ben-Dor et al., 2000; Dudoit & Fridlyand 2003). We want to stress that the microarray classiﬁcation is based on classifying data which are made by sev- eral thousands of features and only few dozens of patterns for each class are available, and this fact may negatively affects the perfor- mance of a classiﬁer (curse of dimensionality) (Jain, Duin, & Mao, 2000). Moreover, in Hua and Lai (2007) a boosting method is used 0957-4174/$ - see front matter Ó 2010 Elsevier Ltd. All rights reserved. doi:10.1016/j.eswa.2010.07.104 * Corresponding author. E-mail address: lnanni@deis.unibo.it (L. Nanni). Expert Systems with Applications 38 (2011) 990–995 Contents lists available at ScienceDirect Expert Systems with Applications journal homepage: www.elsevier.com/locate/eswa