Biclustering of high-throughput gene expression data with BiclusterMiner András Király * , János Abonyi Department of Process Engineering University of Pannonia Veszprem, Hungary Asta Laiho * , Attila Gyenesei *,** Turku Centre for Biotechnology University of Turku and Åbo Akademi University Turku, Finland * Equal contribution. ** Corresponding author. AbstractDuring recent years, many biclustering algorithms have been developed for the analysis of gene expression data to complement and expand the capabilities of traditional clustering methods. With biclustering, genes with similar expression profiles can be identified not only over the whole data set but also across subsets of experimental conditions allowing genes to simultaneously belong to several expression patterns. This property makes biclustering a powerful approach especially when it is applied to data with large number of conditions. In spite of the clear theoretical benefit, the full potential of biclustering has not been realized within the gene expression research community and thus it has never truly become a part of the standard gene expression data analysis. Possible reasons include for example the unrealization of the various complementary ways in which biclustering can be applied to microarray or next-generation sequencing based gene expression data sets and the lack of reliable and fast algorithms. In this paper, we first illustrate the various opportunities of applying biclustering within a typical gene expression data analysis pipeline. Then a new biclustering method (BiclusterMiner) is presented that can be applied to all presented cases. The developed method is the first discrete biclustering algorithm that is able to simultaneously handle both up- and down-regulated genes by taking the direction of regulation into account and still discover all possible maximal biclusters. The efficiency of the proposed algorithm is demonstrated on real and synthetic datasets. Keywords: biclustering, gene expression data analysis I. INTRODUCTION During the last decade DNA microarrays have become a mature and widely used technology for measuring genome wide gene expression level differences between biological samples. The recent development of high-throughput short read sequencing technologies has enabled even more sensitive analysis of gene expression by making the use of predesigned interrogation probes unnecessary. High-throughput gene expression data is typically analyzed to detect the differentially expressed genes between experimental sample condition groups. Many different techniques including various clustering based approaches have been developed for this purpose. High-throughput expression data is typically presented as an expression value matrix with genes as rows and samples as columns. Clustering can then be used for sorting the genes and/or samples based on their general expression value similarity. Each gene and sample is included in the clustering result only once, as traditional clustering algorithms do not allow an item to belong to multiple clusters or being excluded from the clustering result. However, in biological data gene subsets are typically co-expressed only under a subset of samples or sample condition groups. In principle, biclustering provides a solution to this problem as it does not set a priori constrains of the organization of the biclusters, meaning that any gene can belong to multiple or none of the resulting clusters. Thus biclustering is potentially able to identify gene groups that have similar expression patterns over only a subset of samples or sample condition groups. Due to the realization of the underlying potential, several biclustering algorithms have been proposed for the identification of gene expression patterns during the last decade. The first studies to partition a matrix into submatrices come from Morgan and Sonquist [23] and Hartigan [12]. Another attempt, employing hierarchical clustering to both rows and columns in a coupled two-way clustering manner, was proposed by Getz et al. [9]. The term “biclustering” and the first biclustering method using a greedy iterative search approach for gene expression data was proposed by Cheng and Church [7]. Since then numerous biclustering algorithms have been developed identifying different kinds of bicluster structures (e.g. exclusive row and column biclusters, nonoverlapping biclusters, overlapping biclusters, arbitrarily positioned overlapping biclusters) and proposing various mining approaches (e.g. iterative row and column clustering, divide and conquer approach, greedy iterative search, exhaustive bicluster enumeration, and distribution parameter identification). For comprehensive reviews, see [5,22,31]. Despite the clear theoretical benefits, biclustering has never gained very wide popularity among the gene expression analysis community. This may be partially explained by the un-realization of the various opportunities for applying biclustering for gene expression data. Therefore, we here describe the different options available in the context of a typical gene expression data analysis pipeline. As the data has to be discretized in order to apply biclustering, the necessary steps for producing the discretized data matrix are also described in each of the cases. Although there are some biclustering methods that work on real valued data the most popular methods have been developed for discretized, and in practise, binarized data. This is mainly because the methods working on real values 2012 IEEE 12th International Conference on Data Mining Workshops 978-0-7695-4925-5/12 $26.00 © 2012 IEEE DOI 10.1109/ICDMW.2012.42 131 2012 IEEE 12th International Conference on Data Mining Workshops 978-0-7695-4925-5/12 $26.00 © 2012 IEEE DOI 10.1109/ICDMW.2012.42 131