280 bioinfovmatics Computational comparisons of model genomes Christos Ouzounis, Georg Casari, Chris Sander, JavierTamames and Alfonso Valencia Complete genomes from model organisms provide new challenges for computational molecular biology. Novel questions emerge from the genome data obtained from the functional prediction of thousands of gene products. In this review, we present some approaches to the computational comparison of genomes, based on sequence and text analysis, and comparisons of genome composition and gene order. With the recent publication of the complete genome sequences from two bacteria, Haemophilus injuenzae Rdi and Mycoplasma genitalium2, new challenges are emerging for computational biology. Two such chal- lenges are (1) to predict and annotate the fimctions of the gene products as rapidly and completely as poss- ible, and (2) to derive adequate abstractions that make genomes comparable at a higher-than-molecular level. Function prediction is a primary goal of genome- sequence analysis, as many newly determined sequences have no experimental information associ- ated with them, while functional information can be derived by examining homology to proteins of known function. Prediction can be carried out by integrating and co-ordinating a number of well-tested methods that rapidly and efficiently identie sequences with the highest degree of similarity horn complete databases, and can, therefore, assist function prediction using homology observations. The GeneQuiz system3 auto- matically annotates protein-encoding sequences and can identify novel functions, e.g. for H. influenzati and 111.genitaliums sequences*. The use of integrated data- bases and software tools, combined with the appli- cation of a number of empirical rules that can auto- matically eliminate false annotation@, make this possible. If the functional annotations are known, what is the next step in genome analysis? We have been explor- ing new ways to make use of this information, and have been performing global comparisons of genomic data, addressing a number of questions. These analy- ses yield a profile of the composition of genomic func- C. Ouzounis (ou.zouni@ai.sri.com) is at the Art$cial Intelligeflce Center, SRI International, 333 Ravenswood Avenue, Menlo Park, CA 94025, USA. A. Valencia andJ. Tamames are at the Protein Design Group, Centro National de Biofecknologia, CSIC, Campus U. Autonoma, E-28049 Madrid, Spain. G. Casari is at the Euro- pean Molecular Biology Labovatory, MeyevkofstraJe 1, D-69012 Heidelberg, Germany. C. Sander is at the Euvopean Bioinfrmatics Institute, EMBL, Hinxton Hall, Cambridge, UK CBlO 1RQ. tions of an organism, identifying some components that are common to other species, and some that appear to be unique. These predicted ‘expression pat- terns’ can help in the identification of novel or poten- tial metabolic or regulatory pathways, and provide a faster route to the development of targets for drug design and discovery. Orthologues: functionally equivalent genes across species A gene with a certain level of sequence similarity to its homologue in the genome of another species may have the same function as its homologue; such genes are defined as orthologues7. How is it possible to deter- mine whether the two proteins encoded by the genes of different species have the same function? Proteins change during evolution, forming families of related molecules that have similar primary, secondary and tertiary structures, but which have divergent functions. Algorithms for sequence comparison can detect genes encoding homologous proteins, but are unable to determine definitively whether two molecules have exactly the same function. Therefore, function pre- diction by detection of sequence similarity, especially in incomplete genomes, can only be approximate, because it usually makes use of genes encoding the most similar proteins; however, these genes might not yet have been identified. In complete genomes, however, there is a finite number of genes, so it is possible to determine which genes share the highest degree of similarity, thus nar- rowing the set of experiments that must be performed to prove that proteins encoded by orthologous genes have the same function. Therefore, the average similarity value can be calculated, helping in the esti- mation of the rate of change in different families. The *Analysis of the H. in&wrzae, M. yerzitalicrm and S. cerevisiae genomes, including functional classification, is available on the World Wide Web at <http://www.sander.embl-heidelberg.de/ge. TIBTECH AUGUST 1996 (VOL 14) Copyright 0 1996, Elsevier Science Ltd. All rights reserved. 0167 - 7799/96/$15.00. PII: SO167-7799(96)10043-3