Predicting Interresidue Contacts Using Templates
and Pathways
Yu Shao and Christopher Bystroff
*
Department of Biology, Rensselaer Polytechnic Institute, Troy, New York
ABSTRACT We present a novel method,
HMMSTR-CM, for protein contact map predictions.
Contact potentials were calculated by using
HMMSTR, a hidden Markov model for local se-
quence structure correlations. Targets were aligned
against protein templates using a Bayesian method,
and contact maps were generated by using these
alignments. Contact potentials then were used to
evaluate these templates. An ab initio method based
on the target contact potentials using a rule-based
strategy to model the protein-folding pathway was
developed. Fold recognition and ab initio methods
were combined to produce accurate, protein-like
contact maps. Pathways sometimes led to an unam-
biguous prediction of topology, even without using
templates. The results on CASP5 targets are dis-
cussed. Also included is a brief update on the quality
of fully automated ab initio predictions using the
I-sites server. Proteins 2003;53:497–502.
© 2003 Wiley-Liss, Inc.
Key words: predictions; contact maps; HMMSTR;
rule-based; protein folding; I-sites; Ro-
setta; hidden Markov models
INTRODUCTION
Traditional structure prediction methods represent pro-
teins either as three-dimensional structures or linear
strings of secondary structure symbols. Contact maps are
square symmetrical Boolean matrices that represent pro-
tein tertiary structures in a two-dimensional (2D) format.
The 2D format has simplified the process of developing a
rule-based algorithm for protein-folding pathways. The
new algorithm, called HMMSTR-CM, has been tested on
CASP5 targets.
Two-dimensional flat images are more readily discern-
able to the eye and more memorable than complex, rotat-
ing three-dimensional (3D) images. With only a little
training, a student can learn to quickly distinguish a
contact map for an / barrel from a three-layer / fold,
different topologies which are very similar in their second-
ary structures. Similarities between distant homologues
or analogs of / and all folds can be seen easily in
contact maps, even when the 3D structures superimpose
poorly. It makes sense that if our eyes can recognize
protein folds from 2D patterns, that we may be able to
program a computer to do so and thereby create a new tool
for learning the rules of folding.
Contact maps may be projected into three-dimensions if
they satisfy the conditions of a sphere intersection graph of
a self-avoiding chain,
1
which all protein contact maps do
but not all predictions. Methods that reconstruct the
protein structure from its contact map have been devel-
oped.
2–5
Previous contact map prediction methods have used
neural nets,
6,7
correlated mutations,
8 –11
and association
rules.
12,13
Neural net-based predictions had an average
accuracy of about 21% overall,
14
whereas higher accura-
cies were reported for local contacts,
7
but the accuracy is
lower for all- proteins.
Our earlier work
13
led us to believe that two important
factors were missing in contact map predictions. First,
typical predicted contact maps were ambiguous or physi-
cally impossible in 3D. Second, the order of appearance of
contacts was not considered, even though much is known
about folding pathways.
15–18
In the new approach, we
tried to incorporate “physicality” and protein-like charac-
teristics by using protein templates and simple rules. The
rules consist of common sense facts for packing of second-
ary structures. Rules for the order of appearance were
derived from the general assumptions of a nucleation/
propagation pathway.
15
MATERIALS AND METHODS
The results of two methods are discussed here: the
I-sites server, which is fully automated, and HMMSTR-
CM, which was only partially automated. The two methods
consist of suites of programs having a common origin in
the I-sites Library or its hidden Markov model incarnation
HMMSTR. The I-sites server uses the folding simulation
program ROSETTA,
19
originally developed in the Baker
laboratory. The strategy used by the I-sites server has
been previously documented,
20
and no significant changes
were made to it before predicting the CASP5 targets.
The initial steps in processing the query sequence,
database search, and building of a sequence profile are
common to the two methods. Single sequences were submit-
ted to PSI-BLAST,
21
searching nr with an E value cutoff of
0.001. The resulting multiple-sequence alignment was
converted to a sequence profile, as previously described.
The target sequence profile was used to generate 3D
coordinate (TS) using the I-sites server or contact maps
*Correspondence to: Christopher Bystroff, Department of Biology,
Rensselaer Polytechnic Institute, 110 8th St., Troy, NY 12180. E-mail:
bystrc@rpi.edu
Received 13 February 2003; Accepted 19 May 2003
PROTEINS: Structure, Function, and Genetics 53:497–502 (2003)
© 2003 WILEY-LISS, INC.