Quantifying the Similarities within Fold Space Andrew Harrison 1 * , Frances Pearl 1 , Richard Mott 2 , Janet Thornton 1,3 and Christine Orengo 1 1 Biomolecular Structure and Modelling Unit Department of Biochemistry and Molecular Biology University College London Gower Street, London WC1E 6BT, UK 2 Wellcome Trust Centre for Human Genetics Roosevelt Drive, Oxford OX3 7BN, UK 3 European Bioinformatics Institute, Wellcome Trust Genome Campus, Cambridge CB10 1SD, UK We have used GRATH, a graph-based structure comparison algorithm, to map the similarities between the different folds observed in the CATH domain structure database. Statistical analysis of the distributions of the fold similarities has allowed us to assess the significance for any simi- larity. Therefore we have examined whether it is best to represent folds as discrete entities or whether, in fact, a more accurate model would be a continuum wherein folds overlap via common motifs. To do this we have introduced a new statistical measure of fold similarity, termed gregarious- ness. For a particular fold, gregariousness measures how many other folds have a significant structural overlap with that fold, typically comprising 40% or more of the larger structure. Gregarious folds often contain com- monly occurring super-secondary structural motifs, such as b-meanders, greek keys, a b plait motifs or a-hairpins, which are matching similar motifs in other folds. Apart from one example, all the most gregarious folds matching 20% or more of the other folds in the database, are a b proteins. They also occur in highly populated architectural regions of fold space, adopting sandwich-like arrangements containing two or more layers of a-helices and b-strands. Domains that exhibit a low gregariousness, are those that have very distinctive folds, with few common motifs or motifs that are packed in unusual arrangements. Most of the superhelices exhibit low gregarious- ness despite containing some commonly occurring super-secondary struc- tural motifs. In these folds, these common motifs are combined in an unusual way and represent a small proportion of the fold (, 10%). Our results suggest that fold space may be considered as continuous for some architectural arrangements (e.g. a b sandwiches), in that super-secon- dary motifs can be used to link neighbouring fold groups. However, in other regions of fold space much more discrete topologies are observed with little similarity between folds. q 2002 Elsevier Science Ltd. All rights reserved Keywords: fold space; GRATH; fold similarity; CATH; gregariousness *Corresponding author Introduction Here we report on significant structural overlaps between folds and how these similarities are distributed across the set of known structures, also described as “fold space”. There is considerable interest in this distribution as highly recurrent motifs may be associated with favourable fold- ing arrangements of secondary structures and similarities between fold groups may reveal evolu- tionary mechanisms for extending the protein structure repertoire. Many previous analyses of structural similarity have concentrated mainly on identifying global structural relationships. For example the CATH 1 classification of protein folds gives a discrete description of fold space. Currently approximately 750 folds are identified using a robust structure comparison algorithm. Empirical criteria are used for classifying proteins into these fold groups. Classifications such as SCOP 2 and CATH are often used to provide fold libraries for structure predic- tion algorithms such as threading, 3 which attempt to fit sequences to 3D structures by optimising energy profiles. In this context, there has been con- siderable discussion as to whether it is appropriate to consider folds as discrete entities or whether a continuum of folds exists. In the latter case, infor- mation on putative structural neighbours would 0022-2836/02/$ - see front matter q 2002 Elsevier Science Ltd. All rights reserved E-mail address of the corresponding author: harry@biochem.ucl.ac.uk Abbreviations used: CASP, competitive assessment of structure prediction. doi:10.1016/S0022-2836(02)00992-0 available online at http://www.idealibrary.com on B w J. Mol. Biol. (2002) 323, 909–926