Four-Body Contact Potentials Derived From Two Protein Datasets to Discriminate Native Structures From Decoys Yaping Feng, 1,2 Andrzej Kloczkowski, 1,2 and Robert L. Jernigan 1,2 * 1 Department of Biochemistry, Biophysics, and Molecular Biology, Iowa State University, Ames, Iowa 50011-0320 2 L.H.Baker Center for Bioinformatics and Biological Statistics, Iowa State University, Ames, Iowa 50011-3020 ABSTRACT Two-body inter-residue contact potentials for proteins have often been extracted and extensively used for threading. Here, we have developed a new scheme to derive four-body con- tact potentials as a way to consider protein interac- tions in a more cooperative model. We use several datasets of protein native structures to demon- strate that around 500 chains are sufficient to pro- vide a good estimate of these four-body contact potentials by obtaining convergent threading re- sults. We also have deliberately chosen two sets of protein native structures differing in resolution, one with all chains’ resolution better than 1.5 A ˚ and the other with 94.2% of the structures having a re- solution worse than 1.5 A ˚ to investigate whether potentials from well-refined protein datasets per- form better in threading. However, potentials from well-refined proteins did not generate statistically significant better threading results. Our four-body contact potentials can discriminate well between native structures and partially unfolded or deliber- ately misfolded structures. Compared with another set of four-body contact potentials derived by using a Delaunay tessellation algorithm, our four-body contact potentials appear to offer a better charac- terization of the interactions between backbones and side chains and provide better threading re- sults, somewhat complementary to those found using other potentials. Proteins 2007;68:57–66. V V C 2007 Wiley-Liss, Inc. Key words: contact potential; two-body potential; four-body potential; Delauney tessella- tion (DT) INTRODUCTION Prediction of protein three-dimensional structures from the amino acid sequences is a well known goal in compu- tational biology, since the determination of structures by experimental methods, such as NMR spectroscopy and X- ray crystallography, cannot keep pace with the explosion of protein sequence information from genome sequencing efforts, and those experimental structure determinations are costly both in terms of equipment and human effort. 1 A variety of different strategies, including homology mod- eling, molecular dynamics simulations, energy minimiza- tion, and native fold recognition (threading) have been pursued as attempted solutions to this problem. Although homology modeling can lead to accurate predictions of protein structure when closely similar sequences exist, it does not provide much insight regarding the principles of protein folding. Sali et al. 2 have suggested that the lack of a suitable reliable potential function, rather than the design of folding algorithms could be the major bottleneck for struc- ture predictions. Russ and Ranganathan 3 indicated that the potential functions currently used in assessing the free energy changes upon folding are not well defined at the physicochemical level and are often unpredictably imprecise for modeling the experimentally observed energetic proper- ties of proteins. Most at successful protein structure predic- tions use statistical contact potentials in their force fields for threading or ab initio protein structure prediction by analyzing results of the Critical Assessment of Techniques for Protein Structure Prediction. 4,5 However the develop- ment of more effective and more accurate statistical poten- tial functions, to describe interactions between residues, remains a goal for predicting the three-dimensional struc- tures of proteins from the sequences. Because the computational construction of atomic mod- els with huge numbers of degrees of freedom requires enormous computational times, many have advocated the use of coarse-grained models, or low-resolution approaches that significantly reduce the numbers of requisite confor- mational variables. 6 Studies with coarse-grained models have revealed that such simple models can capture impor- tant characteristics of the overall folds. 7–9 The usual approach has been to coarse-grain with a single point rep- resenting each amino acid. Significant efforts have been expended to derive such empirical contact potentials for use in fold recognition. Tanaka and Scheraga 10 first intro- duced pairwise contact potentials to identify protein native conformations. Later Miyazawa and Jernigan 11,12 developed a better basis for them by applying the quasi- chemical approximation. The dependence of these pair- wise potentials on distance cutoff was thoroughly investi- The Supplementary Material referred to in this article can be found at http://www.interscience.wiley.com/jpages/0887-3585/suppmat/ Grant sponsor: NIH; Grant number: R01-GM072014. *Correspondence to: Robert L. Jernigan, Department of Biochem- istry, Biophysics, and Molecular Biology, Iowa State University, Ames, IA 50011-0320. E-mail: jernigan@iastate.edu Received 25 July 2006; Revised 19 November 2006; Accepted 27 November 2006 Published online 28 March 2007 in Wiley InterScience (www. interscience.wiley.com). DOI: 10.1002/prot.21362 V V C 2007 WILEY-LISS, INC. PROTEINS: Structure, Function, and Bioinformatics 68:57–66 (2007)