Analysis of Singleton ORFans in Fully Sequenced
Microbial Genomes
Naomi Siew
1,2
and Daniel Fischer
2
*
1
Department of Chemistry, Ben Gurion University, Beer-Sheva, Israel
2
Bioinformatics Group, Department of Computer Science, Ben Gurion University, Beer-Sheva, Israel
ABSTRACT Singleton sequence ORFans are or-
phan ORFs (open reading frames) that have no
detectable sequence similarity to any other se-
quence in the databases. ORFans are of particular
interest not only as evolutionary puzzles but also
because we can learn little about them using bioin-
formatics tools. Here, we present a first systematic
analysis of singleton ORFans in the first 60 fully
sequenced microbial genomes. We show that al-
though ORFans have been underemphasized, the
number of ORFans is steadily growing, currently
accounting for 23,634 sequences. At the same time,
the percentage of ORFans as a fraction of all se-
quences is slowly diminishing, and is currently
about 14%. Short ORFans comprise about 61% of all
ORFans. The abundance of short ORFans may be
due to a yet unexplained artifact. The data also
suggest that the number of longer ORFans may soon
diminish as more genomes of closely related organ-
isms become available. To better address the ques-
tions about the functions and origins of ORFans, we
propose to focus further studies on the longer OR-
Fans, with emphasis on three new types of ORFans:
ORFan modules, paralogous ORFans, and ortholo-
gous ORFans. We conclude that the large number of
ORFans reflects an intrinsic property of the genetic
material not yet fully understood. Further computa-
tional and experimental studies aimed at under-
standing Nature’s protein diversity should also in-
clude ORFans. Proteins 2003;53:241–251.
© 2003 Wiley-Liss, Inc.
Key words: ORFans; complete genomes; evolution;
singletons; microbial diversity
INTRODUCTION
Since the sequencing in 1995 of the genome of the first
free-living organism, that of Haemophilus influenzae,
1
the
genomes of over a few dozen organisms have been se-
quenced, and dozens more are under way. This wealth of
continuously growing sequence data contains a large num-
ber of protein sequences awaiting interpretation that, once
deciphered, will add to a whole new understanding of
Nature.
The availability of complete genome sequences of mod-
ern organisms has clearly revealed that the genetic mate-
rial is mainly the result of the basic evolutionary process of
descent with modification. Most of the open reading frames
(ORFs) in a newly sequenced organism encode proteins
belonging to homologous families that are more or less
conserved in a number of organisms. Some of these
families contain ORFs from most of the known genomes
and usually correspond to widely conserved functions
essential for life. Other families contain ORFs from organ-
isms belonging to one kingdom only, thus corresponding to
functions specific to that kingdom. In addition to these
relatively conserved families, the currently fully se-
quenced genomes also contain a variety of families with
decreasing levels of conservation. At the lower end, we
observe a non-negligible number of families that contain
ORFs of only a few (generally closely related) organisms,
or of a single organism only. Surprisingly, a large number
of genome sequences belong to single-member families. We
refer to such sequences as orphan ORFs or ORFans for
short.
2–4
ORFans account for 25–30% of the ORFs of each
newly sequenced genome,
5,6
and their percentage can even
be as high as 60%,
7
suggesting that sequence diversity in
Nature may be greater than previously expected. Because
little can be learned about ORFans via homology, only
experimental characterization can help elucidate their
functions and origin.
8 –13
Thus, each ORFan represents a
mystery awaiting interpretation.
13,14
ORFans may correspond to highly divergent sequences
that actually belong to known families (but are beyond
recognition capabilities of current tools),
2
or to sequences
that correspond to new, unique, single-member fami-
lies.
2,15
Because there is no obvious evolutionary mecha-
nism to account for the origin of single-member families,
one might accept the explanation of their origin as extreme
divergence. However, even if all ORFans correspond to
highly divergent members of known families, a number of
puzzling questions arise. For example, how have their
sequences diverged to such an extent that no similar
sequences are detected today?
16
If evolution works through
descent with modification, then why is it that no similar
sequences are found in other organisms? Why is it that we
Grant sponsor: United States–Israel Bination Science Foundation
(BSF), Jerusalem, Israel; Grant number 1998422.
Grant sponsor: N.S. is supported in part by grants from the Ministry
of Science, Israel, and from the Kreitman Foundation Fellowship.
*Correspondence to: Daniel Fischer, Bioinformatics Group, Depart-
ment of Computer Science, Ben Gurion University, Beer-Sheva 84105,
Israel. E-mail: dfischer@cs.bgu.ac.il
Received 31 October 2002; Accepted 3 January 2003
PROTEINS: Structure, Function, and Genetics 53:241–251 (2003)
© 2003 WILEY-LISS, INC.