Gene Expression Data Analysis Using a Novel Approach to Biclustering Combining Discrete and Continuous Data Yann Christinat, Bernd Wachmann, and Lei Zhang Abstract—Many different methods exist for pattern detection in gene expression data. In contrast to classical methods, biclustering has the ability to cluster a group of genes together with a group of conditions (replicates, set of patients, or drug compounds). However, since the problem is NP-complex, most algorithms use heuristic search functions and, therefore, might converge toward local maxima. By using the results of biclustering on discrete data as a starting point for a local search function on continuous data, our algorithm avoids the problem of heuristic initialization. Similar to Order-Preserving Submatrices (OPSM), our algorithm aims to detect biclusters whose rows and columns can be ordered such that row values are growing across the bicluster’s columns and vice versa. Results have been generated on the yeast genome (Saccharomyces cerevisiae), a human cancer data set, and random data. Results on the yeast genome showed that 89 percent of the 100 biggest nonoverlapping biclusters were enriched with Gene Ontology annotations. A comparison with the methods OPSM and Iterative Signature Algorithm (ISA, a generalization of singular value decomposition) demonstrated a better efficiency when using gene and condition orders. We present results on random and real data sets that show the ability of our algorithm to capture statistically significant and biologically relevant biclusters. Index Terms—Data mining, biclustering algorithm, gene expression data, discrete data, simultaneous clustering, microarray analysis. Ç 1 INTRODUCTION D NA microarrays produce quantities of data that serve different purposes: drug development and testing, gene process and function annotation, or cancer diagnosis. Although those goals differ significantly, they all rely on pattern discovery for gene expression data analysis, thereby requiring accurate and specific clustering algorithms. Classical practice uses 1D clustering algorithms to create groups of genes or conditions. However, some regulatory mechanisms occur only in a subset of conditions and genes, and detecting those networks can be very difficult for such algorithms. Biclustering is a novel clustering technique that aims to detect a group of correlated genes with respect to a group of conditions (such as time-series experiments, replicates, population sample, or drug compounds among others). First applied on gene expression data by Cheng and Church in 2000 [1], its success has been growing and many different algorithms have been developed since. According to Madeira and Oliviera [2], biclustering algorithms can be classified with respect to their bicluster type and their algorithm class. Whereas some algorithms look for biclusters with constant values (such as ISA [3], [4], CTWC [5], and spectral biclustering [6]) or constant row or column values (xMotif [7]), some look for biclusters with coherent values (Cheng and Church [1], FLOC [8], and plaid models [9]) or coherent evolution (OPSM [10] and SAMBA [11]). Tagkopoulos et al. also clustered genes and conditions according to different classes of gene expression mechanisms [12]. Note that the type of algorithm class greatly influences the results and although the use of a heuristic hill-climbing algorithm seems very popular, other techniques have been used. ISA and CTWC rely on alternating rows and columns, spectral biclustering is based on linear algebra, and SAMBA combines a graph hashing technique with a local search. Recently, some algorithms used metaheuristics such as evolutionary algorithms and simulated annealing. However, they did not, or barely, outperform Cheng and Church [13], [14]. Prelic et al. applied a simple divide-and-conquer biclustering algorithm to discrete data and demonstrated promising results [15]. One of the most bothersome drawbacks of hill-climbing algorithms resides in their sensitivity to local maxima. Therefore, the initialization of the algorithm plays a very important role. Prelic et al. showed that biclustering on discrete data could yield good results [15]. Consequently, when using the results of biclustering on discrete data as starting points for an algorithm on continuous data, the latter is very likely to find highly relevant biclusters. Based on this concept, we designed a fast biclustering algorithm on discrete data and coupled it with a local search on continuous data. However, another issue needs to be addressed: the choice of a bicluster model and its associated score function. As mentioned above, actual biclustering algorithms aim at IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS, VOL. 5, NO. 4, OCTOBER-DECEMBER 2008 583 . Y. Christinat is with the Laboratory for Computational Biology and Bioinformatics, School of Computer and Communication Sciences, Ecole Polytechnique Fe´de´rale de Lausanne, Station 14, CH-1015 Lausanne, Switzerland. E-mail: yann.christinat@epfl.ch. . B. Wachmann and L. Zhang are with Siemens Corporate Research, 755 College Road East, Princeton, NJ 08540. E-mail: bernd.wachmann@siemens.com, lzhang@cs.sunysb.edu. Manuscript received 16 Mar. 2007; revised 19 July 2007; accepted 28 July 2007; published online 30 Aug. 2007. For information on obtaining reprints of this article, please send e-mail to: tcbb@computer.org, and reference IEEECS Log Number TCBB-2007-03-0032. Digital Object Identifier no. 10.1109/TCBB.2007.70251. 1545-5963/08/$25.00 ß 2008 IEEE Published by the IEEE CS, CI, and EMB Societies & the ACM