371 As the number of completely sequenced genomes rapidly increases, the postgenomic problem of gene function identification becomes ever more pressing. Predicting the structures of proteins encoded by genes of interest is one possible means to glean subtle clues as to the functions of these proteins. There are limitations to this approach to gene identification and a survey of the expected reliability of different protein structure prediction techniques has been undertaken. Addresses Department of Biological Sciences, Brunel University, Uxbridge, Middlesex UB8 3PH, UK; e-mail: David.Jones@brunel.ac.uk Current Opinion in Structural Biology 2000, 10:371–379 0959-440X/00/$ — see front matter © 2000 Elsevier Science Ltd. All rights reserved. Abbreviations CAFASP Critical Assessment of Fully Automated Structure Prediction CASP Critical Assessment in Structure Prediction HMM hidden Markov model ORF open reading frame PDB Protein Data Bank RMSD root mean square deviation Introduction It is expected that a first draft of the complete human genome sequence will be available sometime in the year 2000. Although the completion of the sequence will proba- bly take several more years, this milestone alone represents a major breakthrough in molecular biology. Sequencing efforts for simpler organisms are also continuing to produce increasing volumes of valuable data and, at the time of writ- ing, some 30 or so complete bacterial genome sequences are available in the sequence databanks. As we are now clearly moving into the postsequencing phase of many genome projects, attention is becoming more and more focused on the correct identification of gene function. Assigning a function to a gene is an important first step in characterising its role in the vari- ous cellular processes and, without this information, the value of genome sequencing is greatly reduced. Of course, simple sequence comparison techniques are by far the most widely used method for making an initial identification of a particular gene product. By identify- ing homology between a new gene and a gene of known function, some inferences can be made as to the function of the new gene. How reliably the function can be extrapolated to the new gene depends on a number of factors, but the principle factor is, of course, the degree of sequence similarity observed. In recent years, sequence comparison methods, such as PSI-BLAST [1], or methods based on hidden Markov models (HMMs) [2] have ‘pushed the envelope’ as far as detecting homologous relationships goes. Of course, as more and more remote relationships are being considered, it becomes less clear as to how reliably one can map the function of one gene to another [3]. Nonetheless, sensi- tive sequence comparison techniques are still the most important technology that we have for rapidly characteris- ing new gene products. Despite the power of modern sequence comparison tech- niques, there still remain open reading frames (ORFs) that either match no other entry in existing sequence data- banks or match proteins that are also of unknown function. These sequence orphans or ‘ORFans’ [4 ] are a source of great debate. Certainly, at present, this class of gene repre- sents a large fraction of larger completely sequenced genomes. However, estimates of the number of orphans vary greatly [4 ,5] and, consequently, different opinions exist as to how much of a problem these genes represent. The recent paper by Fischer and Eisenberg [4 ] suggests that the fraction of genomes falling into the category of sequence orphans is unlikely to change rapidly. These conclusions are based on the observation that the number of uncharacterisable ORFs in the yeast genome, for example, has not been reduced substantially in the two years or so since it was first completely sequenced. This is despite the fact that the sizes of the sequence databanks have more than doubled over this period. Underlying this kind of estimate is, of course, the fact that, even for yeast, it is still not possible to be certain about the true number of expressed genes. The fraction of the approximately 6000 ORFs found in the yeast genome that really relate to expressed proteins is still unknown. Indeed, at the extreme end, one recent calculation [5] on the yeast genome suggests that the true fraction of yeast ORFs that are sequence orphans may be as little as 5%. No matter what is the true number of sequence orphans, the fact is that there remains a ‘hard core’ of small sequence families across a wide variety of genomes for which no functional information apparently exists. Direct experimental function determination is perhaps the ideal approach to characterising these orphan proteins. Gene knockout experiments and expression array tech- niques are just two of many experimental techniques that are now being widely applied to function determination. New theoretical methods for predicting gene function have also been proposed [6 •• ,7 •• ]. Two basic ideas are rep- resented by these methods. Firstly, proteins of similar function may ‘co-evolve’ [6 •• ]. In other words, groups of proteins that are found in some organisms, but not others, may share some common function. Proteins that might be found, for example, in aerobic organisms, but never in Protein structure prediction in the postgenomic era David T Jones