Pervasiveness of Gene Conservation and Persistence of Duplicates in Cellular Genomes Fredj Tekaia, Bernard Dujon Unite ´ de Ge ´ne ´tique Mole ´culaire des Levures (CNRS URA1300 and Universite ´ Pierre et Marie Curie UFR927), Institut Pasteur, 25 rue du Dr Roux, F-75724 Paris Cedex 15, France Received: 4 May 1998 / Accepted: 28 September 1998 Abstract. In this work detailed statistics on ancestral gene duplication and gene conservation in completely sequenced cellular genomes are presented. Analysis of open reading frame (ORF) products having simultaneous matches in several distinct organisms showed a signifi- cant correlation between duplication and conservation. Systematic comparisons of predicted proteomes of 23 organisms (including 20 that have been completely se- quenced), have allowed us to quantify the degree of an- cestral duplication within each genome and the level of conservation between genomes, using threshold values calculated for individual organisms. Statistical analysis of various gene proportions revealed interesting trends in gene structure and evolution, such as that (a) more than one-quarter (25%–66%) of the predicted ORF products of the surveyed organisms are not unique, indicating a high level of ancestral duplications; (b) levels of exclu- sive conservation within Bacteria are higher than those within the eukaryal or archaeal domains; and (c) at least one-half (47–99%) of the total predicted ORF products in the surveyed genomes have one or several highly sig- nificant matches in another genome. Significant matches are based on simulations taking into account the mean size of ORF products and the composition of each target organism’s proteome. The methodology we have devel- oped ensures stability and comparability of our results as the number of completely sequenced genomes increases. Key words: Ancestral duplication — Ancestral Con- servation — Organism-specific open reading frames — Sequence extinction Introduction The availability of complete genomic sequences permit- ted systematic comparisons both within and between or- ganisms. While the first type of comparison indicates the degree of ancestral duplications whose products have remained in the genome after various degrees of diver- gence, the second shows the degree of conservation of genes throughout evolution. Sets of gene sequences from human, Caenorhabditis elegans, yeast, and Escherichia coli were compared, and ancient evolutionary conserved regions (ACRs) were detected (Green et al. 1990). Com- parative genomics is a rapidly growing field of investi- gation, and identification of duplicated and conserved gene products is commonly based on their sequence similarity. Given the heterogeneity, complexity, and phy- logenetic distances of the surveyed organisms, and the need for their comparison, we systematically compared all existing proteomes using the same method and pa- rameters. Both comparisons were performed using the first 20 genomes whose complete sequences have been published. Comparisons were facilitated by restricting our analy- ses to amino acid sequences of predicted proteins. The statistical analysis of ancestral duplication and evolution- ary conservation is presented and may form the basis for Correspondence to: Fredj Tekaia; e-mail: tekaia@pasteur.fr J Mol Evol (1999) 49:591–600 © Springer-Verlag New York Inc. 1999