BIOINFORMATICS Vol. 19 Suppl. 1 2003, pages i323–i330 DOI: 10.1093/bioinformatics/btg1045 Extraction of correlated gene clusters from multiple genomic data by generalized kernel canonical correlation analysis Y. Yamanishi 1, , J.-P. Vert 2 , A. Nakaya 1 and M. Kanehisa 1 1 Bioinformatics Center, Institute for Chemical Research, Kyoto University, Gokasho, Uji, Kyoto 611-0011, Japan and 2 Centre de G ´ eostatistique, Ecole des Mines de Paris, 35 rue Saint-Honor ´ e, 77305 Fontainebleau cedex, France Received on Januray 6, 2003; accepted on February 20, 2003 ABSTRACT Motivation: A major issue in computational biology is the reconstruction of pathways from several genomic datasets, such as expression data, protein interaction data and phylogenetic profiles. As a first step toward this goal, it is important to investigate the amount of correlation which exists between these data. Method: We present new methods to measure the cor- relation between several heterogeneous datasets, and to extract sets of genes which share similarities with respect to multiple biological attributes. The originality of our ap- proach is the extension of the concept of correlation for non-vectorial data, which is made possible by the use of generalized kernel canonical correlation analysis (KCCA), and the method we propose to extract groups of genes re- sponsible for the detected correlations. Moreover, two vari- ants of KCCA are proposed when more than two datasets are available. Result: These methods are successfully tested on their ability to recognize operons in the Escherichia coli genome, from the comparison of three datasets corresponding to functional relationships between genes in metabolic pathways, geometrical relationships along the chromosome, and co-expression relationships as observed by gene expression data. Contact: yoshi@kuicr.kyoto-u.ac.jp INTRODUCTION Recent developments in high-throughput technologies have filled biological databases with many kinds of genomic data, such as pathway knowledge (Kanehisa et al., 2002), microarray gene expression data (Eisen et al., 1998), protein-protein interaction data (Ito et al., 2001), phylogenetic profiles (Pellegrini et al., 1999), and several more. The problem of reconstructing pathways from such genomic datasets is a major issue in computational To whom correspondence should be addressed. biology because pathways represent a higher level of biological functions than single genes. As a first step toward this goal, it is crucial to investigate the correlation which exists between multiple biological attributes, and eventually to use this correlation in order to extract biologically meaningful features from heterogeneous genomic data. Indeed, a correlation detected between multiple datasets is likely to be due to some hidden biological phenomenon. Moreover, by selecting the genes responsible for the correlation, one can expect to select groups of genes which play a special role in or are affected by the underlying biological phenomenon. As an example, the existence of operons in prokaryotes is responsible for a form of correlation between several datasets, because genes which form operons are close to each other along chromosomes, have similar expres- sion profiles and can catalyze successive reactions in a pathway. Conversely, one can start from three datasets containing the localization of the genes on the genome, their expression profiles, and the chemical reactions they catalyze in known pathways, and look for correlations between these datasets, in order to finally recover groups of genes, which may form operons. The integration of different kinds of data has been investigated with a variety of approaches so far. Using graph-theoretical arguments, clusters of genes have been extracted from several biological networks using multiple graph comparison by Ogata et al. (2000) and Nakaya et al. (2001). Several approaches using kernel methods have also been proposed, such as the combination of kernel matrices of expression data and phylogenetic profiles (Pavlidis et al., 2001) or the extraction of features from microarray data using a gene network as side information (Vert et al., 2003). In both cases, the goal was to improve the performance of gene function prediction algorithms. A well-known statistical method to investigate the cor- relation between different real-valued attributes is canon- ical correlation analysis (CCA) (Hotelling, 1936). How- Bioinformatics 19(Suppl. 1) c Oxford University Press 2003; all rights reserved. i323 by guest on June 6, 2015 http://bioinformatics.oxfordjournals.org/ Downloaded from