J. Mol. Biol. (1990) 212, 403-428 Definition of General Topological Equivalence in Protein Structures A Procedure Involving Comparison of Properties and Relationships through Simulated Annealing and Dynamic Programming Andrej Salit and Tom L. Blundellt Laboratory of Molecular Biology Department of Crystallography Birlcbeck College, University of London London WClE ‘IHX, England (Received 26 June 1989; accepted 31 October 1989) A protein is defined as an indexed string of elements at each level in the hierarchy of protein structure: sequence, secondary structure, super-secondary structure, etc. The elements, for example, residues or secondary structure segments such as helices or b-strands, are associated with a series of properties and can be involved in a number of relationships with other elements. Element-by-element dissimilarity matrices are then computed and used in the alignment procedure based on the sequence alignment algorithm of Needleman & Wunsch, expanded by the simulated annealing technique to take into account relationships as well as properties. The utility of this method for exploring the variability of various aspects of protein structure and for comparing distantly related proteins is demonstrated by multiple alignment of serine proteinases, aspartic proteinase lobes and globins. 1. Introduction Protein evolution has led to families of structures t,hat have the same “fold” at the level of super- secondary structures, motifs, domains or entire globular proteins (Richardson, 1977, 1981). The thermodynamically stable arrangements available to a polypeptide must satisfy a number of criteria, such as hydrophobicity and close packing of the protein interior and hydrogen bonding of peptide amide and carbonyl functions that become inaccess- ible to the aqueous environment. This is achieved mainly through a-helices and P-sheets, which can be close-packed in relatively few ways (Chothia, 1984). Thus, in evolution, tertiary structure is more conserved than the amino acid sequence and the number of stable folds is limited at each level in the hierarchical structure of proteins. The consequences of this for protein modelling have long been recognized. Closely related homo- logous proteins of known three-dimensional struc- tures can be used t,o model sequences of proteins of unknown tertiary structures (Browne et al., 1969; t On leave from the Department of Biochemistry, ,J. Stefan Institute. Ljubljana, Yugoslavia. $ Author to whom all correspondence should he sent. Greer, 1981; for a review, see Blundell et aE., 1987). The basis of this modelling is the superposition of a number of three-dimensional structures of homolo- gous proteins in order to define topological equiva- lence of amino acid residues (McLachlan, 1979, 1982; Schulz, 1980; KenKnight, 1984). This has been developed into a systematic approach in which several homologous structures can be used in modelling the unknown (Sutcliffe et al., 1987a,b; Blundell et al., 1988, Sali et al., 1990). Thus, rules and procedures have been established to define the relative positions of the conserved secondary struc- tural elements (i.e. the framework: Sutcliffe et al., 1987a), to select appropriate fragments for the vari- able regions, which are often loops (Sibanda & Thornton, 1985; Chothia et al., 1986; Sutcliffe, 1988; Sibanda et al., 1989), and for the replacement of side-chains (Sutcliffe et al., 19876; Summers et al., 1987; McGregor et al., 1987; M. S. Johnson. unpub- lished results). In a parallel development, Jones & Thirup (1986) have shown that modelling using electron density during protein crystallography can also be aided by selection of fragments from a series of other proteins of known three-dimensional structure. However. although proteins within families have the same tertiary fold, the secondary structural 0022%2836/90/060403-26 $03.00/O 0 1990 Academic. Press l,imit,ed