208
Assigning three-dimensional protein folds to genome
sequences is essential to understanding protein function.
Although experimental three-dimensional structures are
currently available for only a very small fraction of these
sequences, computational fold assignment is able to assign
folds to 20–30% of the sequences in various genomes. This
percentage varies depending on the particular organism
under analysis, on the sensitivities of the methods used and
on the number of experimental structures available at the
time the assignment is carried out. The fraction of
assignable sequences is currently increasing at an annual
rate of roughly 18%. If this rate is sustained throughout the
coming years, three-dimensional computational models for
more than half of the genome sequences may be available
by the year 2003.
Addresses
*Faculty of Natural Science, Department of Math and Computer
Science, Beer-Sheva 84015, Israel; e-mail: dfischer@cs.bgu.ac.il
†
UCLA-DOE Laboratory of Structural Biology and Molecular Medicine,
Molecular Biology Institute, University of California Los Angeles, Box
951570, Los Angeles, CA 90095-1570, USA;
e-mail: david@mbi.ucla.edu
Current Opinion in Structural Biology 1999, 9:208–211
http://biomednet.com/elecref/0959440X00900208
© Elsevier Science Ltd ISSN 0959-440X
Abbreviations
3D three-dimensional
ORF open reading frame
ORFan orphan ORF
WWW Word Wide Web
Introduction
The determination of complete genome sequences has
opened possibilities for extending our understanding of
life at the molecular level. Currently, the complete
genomes of about two dozen organisms have been deter-
mined and dozens more are expected to be completed by
the turn of the century (see http://www.tigr.org for a list),
including that of the human. Knowing the three-dimen-
sional structures of the proteins encoded in these
genomes is essential for understanding their molecular
functions; however, three-dimensional (3D) structures
have been determined experimentally for only a small
fraction of these proteins. Thus, in the absence of exper-
imental structures, computational methods aimed at
assigning 3D models are likely to aid in the characteriza-
tion of genome proteins. This review describes recent
work on the computational assignment of folds to com-
plete genomes using homology modeling and fold
assignment or threading techniques; we avoid mention-
ing other computational structure-prediction methods
that, to the best of our knowledge, have not yet been
applied to complete genomes.
Three-dimensional fold assignments
3D assignments to genome proteins can be divided into
two classes. The first corresponds to clear assignments in
which the sequence similarity between the genome pro-
tein and the assigned fold is above a predefined threshold.
The second class corresponds to assignments in which the
sequence similarity between the genome protein and the
assigned fold is below the given threshold and, thus,
sequence similarity alone cannot establish their validity.
Assignments by sequence similarity
Several groups have analyzed complete genomes in order
to identify those sequences with clear sequence similarity
to proteins of known structure [1,2
••
,3
••
,4] (Table 1). To
this end, standard sequence comparison techniques, such
as BLAST [5] or FASTA [6], can be applied. These tech-
niques are very efficient and reliable in detecting the vast
majority of homologs above the, so-called, twilight zone of
sequence similarity. The assignment is carried out as fol-
lows: given an ORF (open reading frame) from a newly
sequenced organism, a database containing the sequences
of known 3D structures is searched in order to detect
sequence similarities above a predefined threshold. If a
sufficiently high-scoring match is found, a 3D model for
the new ORF can thus be assigned. The sequence-to-
sequence alignment largely corresponds to the structural
alignment that could be obtained had a structure existed
for the genome sequence. Thus, from such assignments,
relatively accurate 3D models can be built using homology
modeling techniques [3
••
]. The fraction of the ORFs from
complete genomes that can be assigned a structure
depends on the following four factors: the particular organ-
ism under consideration; the date on which the study was
carried out (new structures are being determined at a very
fast rate, see Conclusions); the sensitivity of the method
used; and the threshold used to consider an assignment
valid, which, in turn, determines the rate of potential false
positives. The reported fractions of fold assignments in the
various genomes vary from 10–15% for the earliest studies
(e.g. [1]) to 15–20% for the most recent and/or more per-
missive studies [2
••
,3
••
].
Assignments with no sequence similarity
Of those genome sequences with no sequence similarity to
proteins of known structure, some correspond to novel,
previously unobserved folds and the others correspond to
folds that have already been observed. During evolution,
structure is better conserved than sequence; consequently,
although a new genome protein may show no sequence
similarity to any protein of known structure, it may adopt a
known fold. Two proteins with a similar 3D fold, but with
no sequence similarity may be distant relatives belonging
to the same superfamily (the sequences have diverged
beyond the level of random similarity among unrelated
Predicting structures for genome proteins
Daniel Fischer* and David Eisenberg
†