How Well Can We Predict Native Contacts in Proteins Based on Decoy Structures and Their Energies? Jiang Zhu, Qianqian Zhu, Yunyu Shi, * and Haiyan Liu * Key Laboratory of Structural Biology, University of Science and Technology of China, Chinese Academy of Sciences, School of Life Sciences, Hefei, Anhui, 230026, China ABSTRACT One strategy for ab initio protein structure prediction is to generate a large number of possible structures (decoys) and select the most ﬁtting ones based on a scoring or free energy func- tion. The conformational space of a protein is huge, and chances are rare that any heuristically gener- ated structure will directly fall in the neighborhood of the native structure. It is desirable that, instead of being thrown away, the unﬁtting decoy struc- tures can provide insights into native structures so prediction can be made progressively. First, we demonstrate that a recently parameterized physics- based effective free energy function based on the GROMOS96 force ﬁeld and a generalized Born/ surface area solvent model is, as several other phys- ics-based and knowledge-based models, capable of distinguishing native structures from decoy struc- tures for a number of widely used decoy databases. Second, we observe a substantial increase in correla- tions of the effective free energies with the degree of similarity between the decoys and the native struc- ture, if the similarity is measured by the content of native inter-residue contacts in a decoy structure rather than its root-mean-square deviation from the native structure. Finally, we investigate the possibil- ity of predicting native contacts based on the fre- quency of occurrence of contacts in decoy struc- tures. For most proteins contained in the decoy databases, a meaningful amount of native contacts can be predicted based on plain frequencies of occurrence at a relatively high level of accuracy. Relative to using plain frequencies, overwhelming improvements in sensitivity of the predictions are observed for the 4_state_reduced decoy sets by apply- ing energy-dependent weighting of decoy struc- tures in determining the frequency. There, approxi- mately 80% native contacts can be predicted at an accuracy of approximately 80% using energy- weighted frequencies. The sensitivity of the plain frequency approach is much lower (20% to 40%). Such improvements are, however, not observed for the other decoy databases. The rationalization and implications of the results are discussed. Proteins 2003;52:598 – 608. © 2003 Wiley-Liss, Inc. Key words: decoy discrimination; generalized Born model; solvent-accessible surface area; protein structure prediction INTRODUCTION Understanding the relationship between the sequence of a protein and its unique three-dimensional structure is a subject that has intrigued scientists for decades. 1 In theory, genome projects will result in known sequences for almost all proteins. The need to discover their structures and functions highlights the importance of protein struc- ture prediction. In one possible scenario, a large number of candidate structures are generated for a peptide sequence and evaluated with an energy function that can distin- guish the native structure from the mis-folded ones (de- coys). Toward the aim of developing such energy functions, two different types of approaches are currently under investigation. The ﬁrst class, the so-called “knowledge- based potentials,” derived from databases of known pro- tein structures, usually represents interactions within proteins at a low level of resolution. Many efﬁcient poten- tials of this category have been widely used in comparative modeling and fold recognition, and have been extensively reviewed. 2–5 The second class is the so-called “physics- based potentials.” 6 –10 Until recently, the physics-based all-atom molecular mechanics energy functions have not been as commonly used to distinguish protein folds as the statistical models because their application needs struc- ture optimization at the atomic level, which is relatively expensive in terms of computational cost. Compared with the statistical potentials, however, the physics-based mod- els have some signiﬁcant advantages. Speciﬁcally, they may possess more general applicability than the statistical models, which can be biased by the databases. This is especially relevant in discriminating native fold and de- coys, because many statistical models have been developed using only knowledge of the native structures or folds, and one should not expect such models to produce reasonable energies/scores for non-native-like decoy structures. Grant sponsor: Chinese National Natural Science Foundation; Grant numbers: 30025013, 39990600; Grant sponsor: National Basic Research Projects; Grant number: G1999075605. J. Zhu’s present address is Department of Biochemistry and Molecu- lar Biophysics, Columbia University and the Howard Hughes Medical Institute, Black Building, 650 West, 168 Street, Room 221, New York, NY 10032. E-mail: jz2106@columbia.edu *Correspondence to: Haiyan Liu or Yunyu Shi, School of Life Sciences, University of Science and Technology of China, Hefei, Anhui, 230026, China. E-mail: hyliu@ustc.edu.cn or yyshi@ustc.edu.cn Received 17 October 2002; Accepted 28 January 2003 PROTEINS: Structure, Function, and Genetics 52:598 – 608 (2003) © 2003 WILEY-LISS, INC.