Pervasiveness of Gene Conservation and Persistence of Duplicates in
Cellular Genomes
Fredj Tekaia, Bernard Dujon
Unite ´ de Ge ´ne ´tique Mole ´culaire des Levures (CNRS URA1300 and Universite ´ Pierre et Marie Curie UFR927), Institut Pasteur, 25 rue du Dr
Roux, F-75724 Paris Cedex 15, France
Received: 4 May 1998 / Accepted: 28 September 1998
Abstract. In this work detailed statistics on ancestral
gene duplication and gene conservation in completely
sequenced cellular genomes are presented. Analysis of
open reading frame (ORF) products having simultaneous
matches in several distinct organisms showed a signifi-
cant correlation between duplication and conservation.
Systematic comparisons of predicted proteomes of 23
organisms (including 20 that have been completely se-
quenced), have allowed us to quantify the degree of an-
cestral duplication within each genome and the level of
conservation between genomes, using threshold values
calculated for individual organisms. Statistical analysis
of various gene proportions revealed interesting trends in
gene structure and evolution, such as that (a) more than
one-quarter (25%–66%) of the predicted ORF products
of the surveyed organisms are not unique, indicating a
high level of ancestral duplications; (b) levels of exclu-
sive conservation within Bacteria are higher than those
within the eukaryal or archaeal domains; and (c) at least
one-half (47–99%) of the total predicted ORF products
in the surveyed genomes have one or several highly sig-
nificant matches in another genome. Significant matches
are based on simulations taking into account the mean
size of ORF products and the composition of each target
organism’s proteome. The methodology we have devel-
oped ensures stability and comparability of our results as
the number of completely sequenced genomes increases.
Key words: Ancestral duplication — Ancestral Con-
servation — Organism-specific open reading frames —
Sequence extinction
Introduction
The availability of complete genomic sequences permit-
ted systematic comparisons both within and between or-
ganisms. While the first type of comparison indicates the
degree of ancestral duplications whose products have
remained in the genome after various degrees of diver-
gence, the second shows the degree of conservation of
genes throughout evolution. Sets of gene sequences from
human, Caenorhabditis elegans, yeast, and Escherichia
coli were compared, and ancient evolutionary conserved
regions (ACRs) were detected (Green et al. 1990). Com-
parative genomics is a rapidly growing field of investi-
gation, and identification of duplicated and conserved
gene products is commonly based on their sequence
similarity. Given the heterogeneity, complexity, and phy-
logenetic distances of the surveyed organisms, and the
need for their comparison, we systematically compared
all existing proteomes using the same method and pa-
rameters. Both comparisons were performed using the
first 20 genomes whose complete sequences have been
published.
Comparisons were facilitated by restricting our analy-
ses to amino acid sequences of predicted proteins. The
statistical analysis of ancestral duplication and evolution-
ary conservation is presented and may form the basis for Correspondence to: Fredj Tekaia; e-mail: tekaia@pasteur.fr
J Mol Evol (1999) 49:591–600
© Springer-Verlag New York Inc. 1999