208 Assigning three-dimensional protein folds to genome sequences is essential to understanding protein function. Although experimental three-dimensional structures are currently available for only a very small fraction of these sequences, computational fold assignment is able to assign folds to 20–30% of the sequences in various genomes. This percentage varies depending on the particular organism under analysis, on the sensitivities of the methods used and on the number of experimental structures available at the time the assignment is carried out. The fraction of assignable sequences is currently increasing at an annual rate of roughly 18%. If this rate is sustained throughout the coming years, three-dimensional computational models for more than half of the genome sequences may be available by the year 2003. Addresses *Faculty of Natural Science, Department of Math and Computer Science, Beer-Sheva 84015, Israel; e-mail: dfischer@cs.bgu.ac.il UCLA-DOE Laboratory of Structural Biology and Molecular Medicine, Molecular Biology Institute, University of California Los Angeles, Box 951570, Los Angeles, CA 90095-1570, USA; e-mail: david@mbi.ucla.edu Current Opinion in Structural Biology 1999, 9:208–211 http://biomednet.com/elecref/0959440X00900208 © Elsevier Science Ltd ISSN 0959-440X Abbreviations 3D three-dimensional ORF open reading frame ORFan orphan ORF WWW Word Wide Web Introduction The determination of complete genome sequences has opened possibilities for extending our understanding of life at the molecular level. Currently, the complete genomes of about two dozen organisms have been deter- mined and dozens more are expected to be completed by the turn of the century (see http://www.tigr.org for a list), including that of the human. Knowing the three-dimen- sional structures of the proteins encoded in these genomes is essential for understanding their molecular functions; however, three-dimensional (3D) structures have been determined experimentally for only a small fraction of these proteins. Thus, in the absence of exper- imental structures, computational methods aimed at assigning 3D models are likely to aid in the characteriza- tion of genome proteins. This review describes recent work on the computational assignment of folds to com- plete genomes using homology modeling and fold assignment or threading techniques; we avoid mention- ing other computational structure-prediction methods that, to the best of our knowledge, have not yet been applied to complete genomes. Three-dimensional fold assignments 3D assignments to genome proteins can be divided into two classes. The first corresponds to clear assignments in which the sequence similarity between the genome pro- tein and the assigned fold is above a predefined threshold. The second class corresponds to assignments in which the sequence similarity between the genome protein and the assigned fold is below the given threshold and, thus, sequence similarity alone cannot establish their validity. Assignments by sequence similarity Several groups have analyzed complete genomes in order to identify those sequences with clear sequence similarity to proteins of known structure [1,2 •• ,3 •• ,4] (Table 1). To this end, standard sequence comparison techniques, such as BLAST [5] or FASTA [6], can be applied. These tech- niques are very efficient and reliable in detecting the vast majority of homologs above the, so-called, twilight zone of sequence similarity. The assignment is carried out as fol- lows: given an ORF (open reading frame) from a newly sequenced organism, a database containing the sequences of known 3D structures is searched in order to detect sequence similarities above a predefined threshold. If a sufficiently high-scoring match is found, a 3D model for the new ORF can thus be assigned. The sequence-to- sequence alignment largely corresponds to the structural alignment that could be obtained had a structure existed for the genome sequence. Thus, from such assignments, relatively accurate 3D models can be built using homology modeling techniques [3 •• ]. The fraction of the ORFs from complete genomes that can be assigned a structure depends on the following four factors: the particular organ- ism under consideration; the date on which the study was carried out (new structures are being determined at a very fast rate, see Conclusions); the sensitivity of the method used; and the threshold used to consider an assignment valid, which, in turn, determines the rate of potential false positives. The reported fractions of fold assignments in the various genomes vary from 10–15% for the earliest studies (e.g. [1]) to 15–20% for the most recent and/or more per- missive studies [2 •• ,3 •• ]. Assignments with no sequence similarity Of those genome sequences with no sequence similarity to proteins of known structure, some correspond to novel, previously unobserved folds and the others correspond to folds that have already been observed. During evolution, structure is better conserved than sequence; consequently, although a new genome protein may show no sequence similarity to any protein of known structure, it may adopt a known fold. Two proteins with a similar 3D fold, but with no sequence similarity may be distant relatives belonging to the same superfamily (the sequences have diverged beyond the level of random similarity among unrelated Predicting structures for genome proteins Daniel Fischer* and David Eisenberg