Predicting Interresidue Contacts Using Templates and Pathways Yu Shao and Christopher Bystroff * Department of Biology, Rensselaer Polytechnic Institute, Troy, New York ABSTRACT We present a novel method, HMMSTR-CM, for protein contact map predictions. Contact potentials were calculated by using HMMSTR, a hidden Markov model for local se- quence structure correlations. Targets were aligned against protein templates using a Bayesian method, and contact maps were generated by using these alignments. Contact potentials then were used to evaluate these templates. An ab initio method based on the target contact potentials using a rule-based strategy to model the protein-folding pathway was developed. Fold recognition and ab initio methods were combined to produce accurate, protein-like contact maps. Pathways sometimes led to an unam- biguous prediction of topology, even without using templates. The results on CASP5 targets are dis- cussed. Also included is a brief update on the quality of fully automated ab initio predictions using the I-sites server. Proteins 2003;53:497–502. © 2003 Wiley-Liss, Inc. Key words: predictions; contact maps; HMMSTR; rule-based; protein folding; I-sites; Ro- setta; hidden Markov models INTRODUCTION Traditional structure prediction methods represent pro- teins either as three-dimensional structures or linear strings of secondary structure symbols. Contact maps are square symmetrical Boolean matrices that represent pro- tein tertiary structures in a two-dimensional (2D) format. The 2D format has simplified the process of developing a rule-based algorithm for protein-folding pathways. The new algorithm, called HMMSTR-CM, has been tested on CASP5 targets. Two-dimensional flat images are more readily discern- able to the eye and more memorable than complex, rotat- ing three-dimensional (3D) images. With only a little training, a student can learn to quickly distinguish a contact map for an /barrel from a three-layer /fold, different topologies which are very similar in their second- ary structures. Similarities between distant homologues or analogs of /and all folds can be seen easily in contact maps, even when the 3D structures superimpose poorly. It makes sense that if our eyes can recognize protein folds from 2D patterns, that we may be able to program a computer to do so and thereby create a new tool for learning the rules of folding. Contact maps may be projected into three-dimensions if they satisfy the conditions of a sphere intersection graph of a self-avoiding chain, 1 which all protein contact maps do but not all predictions. Methods that reconstruct the protein structure from its contact map have been devel- oped. 2–5 Previous contact map prediction methods have used neural nets, 6,7 correlated mutations, 8 –11 and association rules. 12,13 Neural net-based predictions had an average accuracy of about 21% overall, 14 whereas higher accura- cies were reported for local contacts, 7 but the accuracy is lower for all-proteins. Our earlier work 13 led us to believe that two important factors were missing in contact map predictions. First, typical predicted contact maps were ambiguous or physi- cally impossible in 3D. Second, the order of appearance of contacts was not considered, even though much is known about folding pathways. 15–18 In the new approach, we tried to incorporate “physicality” and protein-like charac- teristics by using protein templates and simple rules. The rules consist of common sense facts for packing of second- ary structures. Rules for the order of appearance were derived from the general assumptions of a nucleation/ propagation pathway. 15 MATERIALS AND METHODS The results of two methods are discussed here: the I-sites server, which is fully automated, and HMMSTR- CM, which was only partially automated. The two methods consist of suites of programs having a common origin in the I-sites Library or its hidden Markov model incarnation HMMSTR. The I-sites server uses the folding simulation program ROSETTA, 19 originally developed in the Baker laboratory. The strategy used by the I-sites server has been previously documented, 20 and no significant changes were made to it before predicting the CASP5 targets. The initial steps in processing the query sequence, database search, and building of a sequence profile are common to the two methods. Single sequences were submit- ted to PSI-BLAST, 21 searching nr with an E value cutoff of 0.001. The resulting multiple-sequence alignment was converted to a sequence profile, as previously described. The target sequence profile was used to generate 3D coordinate (TS) using the I-sites server or contact maps *Correspondence to: Christopher Bystroff, Department of Biology, Rensselaer Polytechnic Institute, 110 8th St., Troy, NY 12180. E-mail: bystrc@rpi.edu Received 13 February 2003; Accepted 19 May 2003 PROTEINS: Structure, Function, and Genetics 53:497–502 (2003) © 2003 WILEY-LISS, INC.