Computers & Operations Research 37 (2010) 1361--1368 Contents lists available at ScienceDirect Computers & Operations Research journal homepage: www.elsevier.com/locate/cor An application of kernel methods to gene cluster temporal meta-analysis Marco Antoniotti a, , Marco Carreras a , Antonella Farinaccio a , Giancario Mauri a , Daniele Merico b,c , Italo Zoppis a a Dipartimento di Informatica, Sistemistica e Comunicazione, Università degli Studi di Milano Bicocca, U14, Viale Sarca 336, I-20126 Milano, Italy b Dipartimento di Scienze Biomolecolari e Biotecnologie (DSBB), Università degli Studi di Milano, Via Celoria 26, I-20133 Milano, Italy c Banting and Best Department of Medical Research, Terrence Donnelly Center for Cellular & Biomolecular Research, University of Toronto, 160 College Street, Toronto, ON, Canada M5S-3E1 ARTICLE INFO ABSTRACT Available online 26 March 2009 Keywords: Clustering Gene ontology Kernel methods The application of various clustering techniques for large-scale gene-expression measurement experi- ments is a well-established method in bioinformatics. Clustering is also usually accompanied by functional characterization of gene sets by assessing statistical enrichments of structured vocabularies, such as the gene ontology (GO) [Gene Ontology Consortium. The gene ontology (GO) project in 2006. Nucleic Acids Research (Database issue), vol. 34; 2006. p. D322–6]. If different clusters are generated for correlated experiments, a machine learning step termed cluster meta-analysis may be performed, in order to discover relations among the components of such sets. Several approaches have been proposed: in particular, kernel methods may be used to exploit the graphical structure of typical ontologies such as GO. Following up the formulation of such approach [Merico D, Zoppis I, Antoniotti M, Mauri G. Evaluating graph kernel methods for relation discovery in GO-annotated clusters. In: KES-2007/WIRN-2007, Part IV, Lecture notes in artificial intelligence, vol. 4694. Berlin: Springer; 2007. p. 892–900; Zoppis I, Merico D, Antoniotti M, Mishra B, Mauri G. Discovering relations among GO-annotated clusters by graph kernel methods. In: Pro- ceedings of the 2007 international symposium on bioinformatics research and applications. Lecture notes in computer science, vol. 4463. Berlin: Springer; 2007], in this paper we discuss, from an information- theoretic point of view, further results about its applicability and its performance. © 2009 Elsevier Ltd. All rights reserved. 1. Introduction Modern biology has been revolutionized by the adoption of high- throughput techniques: the newly acquired capability to generate massive amount of data requires increasing contributions from the computer science community, posing many novel and interesting problems [20]. In this section we will first provide a description of the problem we address, in the context of gene expression studies, and then outline the strategy for its solution using a kernel function. Paper organization. In order to help in the navigation of the ma- terial, we provide the following “map of the paper”. First, in Section 1.1, we describe what we intend as cluster meta- analysis in the context of what are known in biology and bioinfor- matics as enrichment studies, where sets (or “clusters”) of genes or other biologically relevant items are tagged with labels usually com- ing from a controlled vocabulary or ontology. Next, we describe how This work has been supported in part by a Università degli Studi Milano Bicocca FIAR Grant and by the EC “Marie Curie” Grant MIRG-CT-2005-031140. Corresponding author. Tel./fax: +39 02 64 48 79 01. E-mail address: antoniotti.marco@disco.unimib.it (M. Antoniotti). 0305-0548/$ - see front matter © 2009 Elsevier Ltd. All rights reserved. doi:10.1016/j.cor.2009.03.011 we are tackling one particular aspect of cluster meta-analysis: we introduce the use of a Kernel approach to perform the enrichment step, in order to exploit the hierarchical structure of ontologies and of the gene ontology (GO) in particular. Second, in Section 2, we describe what data sets we considered in our study and how we pre-processed them in order to produce an initial set of clusters. This initial set of clusters was then submitted to our kernel algorithm for “enrichment”. The set of clusters is po- tentially organized in a temporal sequence, with a subset of clusters taken at time t i and another subset taken at time t i+1 and so on. We are interested in finding relations among clusters belonging to two adjacent “time points”. The approach can then be extended to the whole sequence; its mathematical formulation and its assessment takes up the rest of the paper. In Section 3, we describe the mathematical underpinning of our application of kernel methods and—in particular—graph kernels to the cross-cluster enrichment problem. Always in Section 3.2 we discuss what quality scores we used to assess the goodness of our method with respect to other, well known, measures (i.e., the Jaccard coefficient). In particular we use an information theoretic approach to justify and measure the reduction of uncertainty that we obtain by tagging (selecting) a certain relationship between two (gene)