How Well Can We Predict Native Contacts in Proteins Based
on Decoy Structures and Their Energies?
Jiang Zhu, Qianqian Zhu, Yunyu Shi,
*
and Haiyan Liu
*
Key Laboratory of Structural Biology, University of Science and Technology of China, Chinese Academy of Sciences, School of
Life Sciences, Hefei, Anhui, 230026, China
ABSTRACT One strategy for ab initio protein
structure prediction is to generate a large number
of possible structures (decoys) and select the most
fitting ones based on a scoring or free energy func-
tion. The conformational space of a protein is huge,
and chances are rare that any heuristically gener-
ated structure will directly fall in the neighborhood
of the native structure. It is desirable that, instead
of being thrown away, the unfitting decoy struc-
tures can provide insights into native structures so
prediction can be made progressively. First, we
demonstrate that a recently parameterized physics-
based effective free energy function based on the
GROMOS96 force field and a generalized Born/
surface area solvent model is, as several other phys-
ics-based and knowledge-based models, capable of
distinguishing native structures from decoy struc-
tures for a number of widely used decoy databases.
Second, we observe a substantial increase in correla-
tions of the effective free energies with the degree of
similarity between the decoys and the native struc-
ture, if the similarity is measured by the content of
native inter-residue contacts in a decoy structure
rather than its root-mean-square deviation from the
native structure. Finally, we investigate the possibil-
ity of predicting native contacts based on the fre-
quency of occurrence of contacts in decoy struc-
tures. For most proteins contained in the decoy
databases, a meaningful amount of native contacts
can be predicted based on plain frequencies of
occurrence at a relatively high level of accuracy.
Relative to using plain frequencies, overwhelming
improvements in sensitivity of the predictions are
observed for the 4_state_reduced decoy sets by apply-
ing energy-dependent weighting of decoy struc-
tures in determining the frequency. There, approxi-
mately 80% native contacts can be predicted at an
accuracy of approximately 80% using energy-
weighted frequencies. The sensitivity of the plain
frequency approach is much lower (20% to 40%).
Such improvements are, however, not observed for
the other decoy databases. The rationalization and
implications of the results are discussed. Proteins
2003;52:598 – 608. © 2003 Wiley-Liss, Inc.
Key words: decoy discrimination; generalized Born
model; solvent-accessible surface area;
protein structure prediction
INTRODUCTION
Understanding the relationship between the sequence of
a protein and its unique three-dimensional structure is a
subject that has intrigued scientists for decades.
1
In
theory, genome projects will result in known sequences for
almost all proteins. The need to discover their structures
and functions highlights the importance of protein struc-
ture prediction. In one possible scenario, a large number of
candidate structures are generated for a peptide sequence
and evaluated with an energy function that can distin-
guish the native structure from the mis-folded ones (de-
coys). Toward the aim of developing such energy functions,
two different types of approaches are currently under
investigation. The first class, the so-called “knowledge-
based potentials,” derived from databases of known pro-
tein structures, usually represents interactions within
proteins at a low level of resolution. Many efficient poten-
tials of this category have been widely used in comparative
modeling and fold recognition, and have been extensively
reviewed.
2–5
The second class is the so-called “physics-
based potentials.”
6 –10
Until recently, the physics-based
all-atom molecular mechanics energy functions have not
been as commonly used to distinguish protein folds as the
statistical models because their application needs struc-
ture optimization at the atomic level, which is relatively
expensive in terms of computational cost. Compared with
the statistical potentials, however, the physics-based mod-
els have some significant advantages. Specifically, they
may possess more general applicability than the statistical
models, which can be biased by the databases. This is
especially relevant in discriminating native fold and de-
coys, because many statistical models have been developed
using only knowledge of the native structures or folds, and
one should not expect such models to produce reasonable
energies/scores for non-native-like decoy structures.
Grant sponsor: Chinese National Natural Science Foundation;
Grant numbers: 30025013, 39990600; Grant sponsor: National Basic
Research Projects; Grant number: G1999075605.
J. Zhu’s present address is Department of Biochemistry and Molecu-
lar Biophysics, Columbia University and the Howard Hughes Medical
Institute, Black Building, 650 West, 168 Street, Room 221, New York,
NY 10032. E-mail: jz2106@columbia.edu
*Correspondence to: Haiyan Liu or Yunyu Shi, School of Life
Sciences, University of Science and Technology of China, Hefei, Anhui,
230026, China. E-mail: hyliu@ustc.edu.cn or yyshi@ustc.edu.cn
Received 17 October 2002; Accepted 28 January 2003
PROTEINS: Structure, Function, and Genetics 52:598 – 608 (2003)
© 2003 WILEY-LISS, INC.