371
As the number of completely sequenced genomes rapidly
increases, the postgenomic problem of gene function
identification becomes ever more pressing. Predicting the
structures of proteins encoded by genes of interest is one
possible means to glean subtle clues as to the functions of
these proteins. There are limitations to this approach to gene
identification and a survey of the expected reliability of different
protein structure prediction techniques has been undertaken.
Addresses
Department of Biological Sciences, Brunel University, Uxbridge,
Middlesex UB8 3PH, UK; e-mail: David.Jones@brunel.ac.uk
Current Opinion in Structural Biology 2000, 10:371–379
0959-440X/00/$ — see front matter
© 2000 Elsevier Science Ltd. All rights reserved.
Abbreviations
CAFASP Critical Assessment of Fully Automated Structure Prediction
CASP Critical Assessment in Structure Prediction
HMM hidden Markov model
ORF open reading frame
PDB Protein Data Bank
RMSD root mean square deviation
Introduction
It is expected that a first draft of the complete human
genome sequence will be available sometime in the year
2000. Although the completion of the sequence will proba-
bly take several more years, this milestone alone represents
a major breakthrough in molecular biology. Sequencing
efforts for simpler organisms are also continuing to produce
increasing volumes of valuable data and, at the time of writ-
ing, some 30 or so complete bacterial genome sequences
are available in the sequence databanks.
As we are now clearly moving into the postsequencing
phase of many genome projects, attention is becoming
more and more focused on the correct identification of
gene function. Assigning a function to a gene is an
important first step in characterising its role in the vari-
ous cellular processes and, without this information, the
value of genome sequencing is greatly reduced. Of
course, simple sequence comparison techniques are by
far the most widely used method for making an initial
identification of a particular gene product. By identify-
ing homology between a new gene and a gene of known
function, some inferences can be made as to the function
of the new gene. How reliably the function can be
extrapolated to the new gene depends on a number of
factors, but the principle factor is, of course, the degree
of sequence similarity observed.
In recent years, sequence comparison methods, such as
PSI-BLAST [1], or methods based on hidden Markov
models (HMMs) [2] have ‘pushed the envelope’ as far as
detecting homologous relationships goes. Of course, as
more and more remote relationships are being considered,
it becomes less clear as to how reliably one can map the
function of one gene to another [3]. Nonetheless, sensi-
tive sequence comparison techniques are still the most
important technology that we have for rapidly characteris-
ing new gene products.
Despite the power of modern sequence comparison tech-
niques, there still remain open reading frames (ORFs) that
either match no other entry in existing sequence data-
banks or match proteins that are also of unknown function.
These sequence orphans or ‘ORFans’ [4
•
] are a source of
great debate. Certainly, at present, this class of gene repre-
sents a large fraction of larger completely sequenced
genomes. However, estimates of the number of orphans
vary greatly [4
•
,5] and, consequently, different opinions
exist as to how much of a problem these genes represent.
The recent paper by Fischer and Eisenberg [4
•
] suggests
that the fraction of genomes falling into the category of
sequence orphans is unlikely to change rapidly.
These conclusions are based on the observation that the
number of uncharacterisable ORFs in the yeast genome,
for example, has not been reduced substantially in the two
years or so since it was first completely sequenced. This is
despite the fact that the sizes of the sequence databanks
have more than doubled over this period. Underlying this
kind of estimate is, of course, the fact that, even for yeast,
it is still not possible to be certain about the true number
of expressed genes. The fraction of the approximately
6000 ORFs found in the yeast genome that really relate to
expressed proteins is still unknown. Indeed, at the
extreme end, one recent calculation [5] on the yeast
genome suggests that the true fraction of yeast ORFs that
are sequence orphans may be as little as 5%.
No matter what is the true number of sequence orphans,
the fact is that there remains a ‘hard core’ of small
sequence families across a wide variety of genomes for
which no functional information apparently exists.
Direct experimental function determination is perhaps the
ideal approach to characterising these orphan proteins.
Gene knockout experiments and expression array tech-
niques are just two of many experimental techniques that
are now being widely applied to function determination.
New theoretical methods for predicting gene function
have also been proposed [6
••
,7
••
]. Two basic ideas are rep-
resented by these methods. Firstly, proteins of similar
function may ‘co-evolve’ [6
••
]. In other words, groups of
proteins that are found in some organisms, but not others,
may share some common function. Proteins that might be
found, for example, in aerobic organisms, but never in
Protein structure prediction in the postgenomic era
David T Jones