Biclustering of high-throughput gene expression data with BiclusterMiner
András Király
*
, János Abonyi
Department of Process Engineering
University of Pannonia
Veszprem, Hungary
Asta Laiho
*
, Attila Gyenesei
*,**
Turku Centre for Biotechnology
University of Turku and Åbo Akademi University
Turku, Finland
*
Equal contribution.
**
Corresponding author.
Abstract— During recent years, many biclustering algorithms
have been developed for the analysis of gene expression data to
complement and expand the capabilities of traditional
clustering methods. With biclustering, genes with similar
expression profiles can be identified not only over the whole
data set but also across subsets of experimental conditions
allowing genes to simultaneously belong to several expression
patterns. This property makes biclustering a powerful
approach especially when it is applied to data with large
number of conditions. In spite of the clear theoretical benefit,
the full potential of biclustering has not been realized within
the gene expression research community and thus it has never
truly become a part of the standard gene expression data
analysis. Possible reasons include for example the unrealization
of the various complementary ways in which biclustering can
be applied to microarray or next-generation sequencing based
gene expression data sets and the lack of reliable and fast
algorithms. In this paper, we first illustrate the various
opportunities of applying biclustering within a typical gene
expression data analysis pipeline. Then a new biclustering
method (BiclusterMiner) is presented that can be applied to all
presented cases. The developed method is the first discrete
biclustering algorithm that is able to simultaneously handle
both up- and down-regulated genes by taking the direction of
regulation into account and still discover all possible maximal
biclusters. The efficiency of the proposed algorithm is
demonstrated on real and synthetic datasets.
Keywords: biclustering, gene expression data analysis
I. INTRODUCTION
During the last decade DNA microarrays have become a
mature and widely used technology for measuring genome
wide gene expression level differences between biological
samples. The recent development of high-throughput short
read sequencing technologies has enabled even more
sensitive analysis of gene expression by making the use of
predesigned interrogation probes unnecessary.
High-throughput gene expression data is typically
analyzed to detect the differentially expressed genes between
experimental sample condition groups. Many different
techniques including various clustering based approaches
have been developed for this purpose. High-throughput
expression data is typically presented as an expression value
matrix with genes as rows and samples as columns.
Clustering can then be used for sorting the genes and/or
samples based on their general expression value similarity.
Each gene and sample is included in the clustering result
only once, as traditional clustering algorithms do not allow
an item to belong to multiple clusters or being excluded from
the clustering result. However, in biological data gene
subsets are typically co-expressed only under a subset of
samples or sample condition groups. In principle,
biclustering provides a solution to this problem as it does not
set a priori constrains of the organization of the biclusters,
meaning that any gene can belong to multiple or none of the
resulting clusters. Thus biclustering is potentially able to
identify gene groups that have similar expression patterns
over only a subset of samples or sample condition groups.
Due to the realization of the underlying potential, several
biclustering algorithms have been proposed for the
identification of gene expression patterns during the last
decade. The first studies to partition a matrix into
submatrices come from Morgan and Sonquist [23] and
Hartigan [12]. Another attempt, employing hierarchical
clustering to both rows and columns in a coupled two-way
clustering manner, was proposed by Getz et al. [9]. The term
“biclustering” and the first biclustering method using a
greedy iterative search approach for gene expression data
was proposed by Cheng and Church [7]. Since then
numerous biclustering algorithms have been developed
identifying different kinds of bicluster structures (e.g.
exclusive row and column biclusters, nonoverlapping
biclusters, overlapping biclusters, arbitrarily positioned
overlapping biclusters) and proposing various mining
approaches (e.g. iterative row and column clustering, divide
and conquer approach, greedy iterative search, exhaustive
bicluster enumeration, and distribution parameter
identification). For comprehensive reviews, see [5,22,31].
Despite the clear theoretical benefits, biclustering has
never gained very wide popularity among the gene
expression analysis community. This may be partially
explained by the un-realization of the various opportunities
for applying biclustering for gene expression data. Therefore,
we here describe the different options available in the context
of a typical gene expression data analysis pipeline. As the
data has to be discretized in order to apply biclustering, the
necessary steps for producing the discretized data matrix are
also described in each of the cases.
Although there are some biclustering methods that work
on real valued data the most popular methods have been
developed for discretized, and in practise, binarized data.
This is mainly because the methods working on real values
2012 IEEE 12th International Conference on Data Mining Workshops
978-0-7695-4925-5/12 $26.00 © 2012 IEEE
DOI 10.1109/ICDMW.2012.42
131
2012 IEEE 12th International Conference on Data Mining Workshops
978-0-7695-4925-5/12 $26.00 © 2012 IEEE
DOI 10.1109/ICDMW.2012.42
131