Computers & Operations Research 37 (2010) 1361--1368
Contents lists available at ScienceDirect
Computers & Operations Research
journal homepage: www.elsevier.com/locate/cor
An application of kernel methods to gene cluster temporal meta-analysis
Marco Antoniotti
a, ∗
, Marco Carreras
a
, Antonella Farinaccio
a
, Giancario Mauri
a
, Daniele Merico
b,c
,
Italo Zoppis
a
a
Dipartimento di Informatica, Sistemistica e Comunicazione, Università degli Studi di Milano Bicocca, U14, Viale Sarca 336, I-20126 Milano, Italy
b
Dipartimento di Scienze Biomolecolari e Biotecnologie (DSBB), Università degli Studi di Milano, Via Celoria 26, I-20133 Milano, Italy
c
Banting and Best Department of Medical Research, Terrence Donnelly Center for Cellular & Biomolecular Research, University of Toronto, 160 College Street, Toronto, ON, Canada
M5S-3E1
ARTICLE INFO ABSTRACT
Available online 26 March 2009
Keywords:
Clustering
Gene ontology
Kernel methods
The application of various clustering techniques for large-scale gene-expression measurement experi-
ments is a well-established method in bioinformatics. Clustering is also usually accompanied by functional
characterization of gene sets by assessing statistical enrichments of structured vocabularies, such as the
gene ontology (GO) [Gene Ontology Consortium. The gene ontology (GO) project in 2006. Nucleic Acids
Research (Database issue), vol. 34; 2006. p. D322–6]. If different clusters are generated for correlated
experiments, a machine learning step termed cluster meta-analysis may be performed, in order to discover
relations among the components of such sets. Several approaches have been proposed: in particular,
kernel methods may be used to exploit the graphical structure of typical ontologies such as GO. Following
up the formulation of such approach [Merico D, Zoppis I, Antoniotti M, Mauri G. Evaluating graph kernel
methods for relation discovery in GO-annotated clusters. In: KES-2007/WIRN-2007, Part IV, Lecture notes
in artificial intelligence, vol. 4694. Berlin: Springer; 2007. p. 892–900; Zoppis I, Merico D, Antoniotti M,
Mishra B, Mauri G. Discovering relations among GO-annotated clusters by graph kernel methods. In: Pro-
ceedings of the 2007 international symposium on bioinformatics research and applications. Lecture notes
in computer science, vol. 4463. Berlin: Springer; 2007], in this paper we discuss, from an information-
theoretic point of view, further results about its applicability and its performance.
© 2009 Elsevier Ltd. All rights reserved.
1. Introduction
Modern biology has been revolutionized by the adoption of high-
throughput techniques: the newly acquired capability to generate
massive amount of data requires increasing contributions from the
computer science community, posing many novel and interesting
problems [20]. In this section we will first provide a description of
the problem we address, in the context of gene expression studies,
and then outline the strategy for its solution using a kernel function.
Paper organization. In order to help in the navigation of the ma-
terial, we provide the following “map of the paper”.
First, in Section 1.1, we describe what we intend as cluster meta-
analysis in the context of what are known in biology and bioinfor-
matics as enrichment studies, where sets (or “clusters”) of genes or
other biologically relevant items are tagged with labels usually com-
ing from a controlled vocabulary or ontology. Next, we describe how
This work has been supported in part by a Università degli Studi Milano Bicocca
FIAR Grant and by the EC “Marie Curie” Grant MIRG-CT-2005-031140.
∗
Corresponding author. Tel./fax: +39 02 64 48 79 01.
E-mail address: antoniotti.marco@disco.unimib.it (M. Antoniotti).
0305-0548/$ - see front matter © 2009 Elsevier Ltd. All rights reserved.
doi:10.1016/j.cor.2009.03.011
we are tackling one particular aspect of cluster meta-analysis: we
introduce the use of a Kernel approach to perform the enrichment
step, in order to exploit the hierarchical structure of ontologies and
of the gene ontology (GO) in particular.
Second, in Section 2, we describe what data sets we considered
in our study and how we pre-processed them in order to produce an
initial set of clusters. This initial set of clusters was then submitted
to our kernel algorithm for “enrichment”. The set of clusters is po-
tentially organized in a temporal sequence, with a subset of clusters
taken at time t
i
and another subset taken at time t
i+1
and so on. We
are interested in finding relations among clusters belonging to two
adjacent “time points”. The approach can then be extended to the
whole sequence; its mathematical formulation and its assessment
takes up the rest of the paper.
In Section 3, we describe the mathematical underpinning of our
application of kernel methods and—in particular—graph kernels to
the cross-cluster enrichment problem. Always in Section 3.2 we
discuss what quality scores we used to assess the goodness of our
method with respect to other, well known, measures (i.e., the Jaccard
coefficient). In particular we use an information theoretic approach
to justify and measure the reduction of uncertainty that we obtain
by tagging (selecting) a certain relationship between two (gene)