Raking and Selection of Differentially Expressed Genes from Microarray Data J. SHAIK and M. YEASIN Computer Vision Pattern and Image Analysis (CVPIA) laboratory Electical and Computer Engineering University of Memphis, Memphis, TN- 38152, USA jshaik@memphis.edu , myeasin@memphis.edu http://www.eece.memphis.edu/people/faculty/myeasin/CVPIA/ Abstract: - This paper presents adaptive algorithms for ranking and selecting differentially expressed genes from microarray data. A ranking method originally proposed in [1] is adapted and supplemented with Hausdorff distance- based ranking method to improve the performance of the ranking algorithm. A weighted fusion scheme is developed to fuse the ‘mean’ and the Hausdorff distance-based ranking methods to develop a robust ranking method. The normalized consistency measure is used as the weight for the fusion of ranking methods. An adaptive subspace iteration (ASI) based selection algorithm is then applied on top ranked genes to select highly differentially expressed genes. To illustrate the utility of the proposed algorithms, a number of empirical analyses were conducted on both the simulated (400 simulated microarray dataset) and real microarray datasets (colon cancer dataset, gastric cancer dataset). From the empirical analysis it was observed that the proposed unified approach is robust against initialization and yields consistent selection of differentially expressed genes. Key-Words: - Adaptive Sub-space Iteration, Clustering, Ranking, Differentially Expressed Genes and Micro- array Data Analysis. 1 Introduction Real microarray data sets have small number of variables (in the order of 10 2 10 4 ) and samples/experimental conditions (in the order of 10 1 - 10 2 ). Several problems arise in analyzing microarray data that include (not limited to): (i) small sample size when compared to features; (ii) relative importance of individual samples; (iii) inadequate understanding of the underlying model distribution; (iv) experimental noise; (v) lack of ground truth information; (vi) redundancy among the high ranked genes. Several algorithms (e.g., using statistics [2-8], information theory [9-15], or on some functions of classifier outputs [4]) have been reported in ranking the microarray data. The key problems with most of the reported algorithms include (not limited to) (i) sensitivity to the initialization; (ii) lack of adaptivity in ranking and selection of differentially expressed genes and (iii) absence of evaluation methodologies of the computed results. To solve some of the above mentioned problems this paper presents a unified framework in finding differentially expressed genes using adaptive ranking and selection algorithms. The ranking algorithm originally proposed in [1] is adapted and supplemented with Hausdorff distance-based ranking method to improve the performance of the ranking algorithm. A weighted fusion scheme is developed to fuse the ‘mean’ and the Hausdorff distance-based ranking methods to develop a robust ranking method. The normalized consistency measure (cf. equation 2) has been used as the weights for the fusion of ranking methods. An adaptive subspace iteration (ASI) based selection algorithm is then applied on top ranked genes to select highly differentially expressed genes [8, 16]. The computed results were validated using the silhouette index of the clusters. The problem relating the mean method can be alleviated using Hausdorff distance measure. It works with unequal number of samples in both cases and random selection of samples is not necessary. The Hausdorff distance may also be influenced by the outlier sample(s). This problem can be addressed by using the K th Hausdorff distance. Also the samples themselves are involved in finding the difference of expression rather than a single statistic representing all the samples like in the case of ‘mean’ method. To improve the robustness both the mean and Hausdroff distance-based method are fused using the consistency measure. Selection and validation of differentially expressed genes is performed using the ASI algorithm on a fixed number of top ranked genes. It is hypothesized that if the top ranked genes fall into the same cluster they may be highly differentially expressed. This assumption may not always hold as expected. The solution to this problem can be found in ASI Proceedings of the 2006 WSEAS International Conference on Mathematical Biology and Ecology, Miami, Florida, USA, January 18-20, 2006 (pp140-145)