Long Loop Prediction Using the Protein Local Optimization Program Kai Zhu, David L. Pincus, Suwen Zhao, and Richard A. Friesner * Department of Chemistry, Columbia University, New York, New York ABSTRACT We have developed an improved sampling algorithm and energy model for protein loop prediction, the combination of which has yielded the first methodology capable of achieving good results for the prediction of loop backbone conformations of 11 residue length or greater. Ap- plied to our newly constructed test suite of 104 loops ranging from 11 to 13 residues, our method obtains average/median global backbone root-mean-square deviations (RMSDs) to the native structure (superim- posing the body of the protein, not the loop itself) of 1.00/0.62 Å for 11 residue loops, 1.15/0.60 Å for 12 residue loops, and 1.25/0.76 Å for 13 residue loops. Sampling errors are virtually eliminated, while en- ergy errors leading to large backbone RMSDs are very infrequent compared to any previously re- ported efforts, including our own previous study. We attribute this success to both an improved sam- pling algorithm and, more critically, the inclusion of a hydrophobic term, which appears to approxi- mately fix a major flaw in SGB solvation model that we have been employing. A discussion of these results in the context of the general question of the accuracy of continuum solvation models is pre- sented. Proteins 2006;65:438 – 452. © 2006 Wiley-Liss, Inc. Key words: loop prediction; conformational sam- pling; continuum solvation model; hy- drophobic INTRODUCTION Loop prediction has become a canonical problem in assessing methods for high-resolution protein structural modeling. Well-defined test cases can be constructed by starting with a high-resolution structure from the Protein Data Bank (PDB), defining a loop region, and predicting the structure in that region while keep the remaining residues of the protein fixed at their crystallographic coordinates. Realistic applications, such as enumeration of alternative low-energy conformations of the loop (as, for example, are frequently seen in flexible active sites such as kinases), or construction of accurate loop conformations in homology modeling, require reprediction of surrounding side chains (and possibly other degrees of freedom) as well as the loop itself. Thus, success in repredicting native loops in the fixed, crystallographically determined environment, is necessary, but not sufficient, to enable useful practical deployment of the methodology. In previous work, 1 we have introduced a new approach to loop prediction, in which rigorous hierarchical sampling algorithms are combined with a high-quality molecular mechanics force field and continuum solvation model. These methods have been implemented in the Protein Local Optimization Program (PLOP) and were tested on a suite of 800 loops ranging in length from 4 to 12 residues. Qualitatively improved accuracy was obtained compared to previous efforts at loop prediction, which principally have employed approximate, knowledge-based potential energy functions, as opposed to a model based on an atomic level description of the physical chemistry. Although the results in Ref. 1 were encouraging, the performance of the method clearly deteriorated beyond a loop length of 9 residues. Both sampling errors (i.e., cases where the total energy of the predicted structure was significantly higher than that of the minimized, or side- chain optimized native structure) and energy errors (cases where the total energy of the predicted structure was significantly lower than that of the native structure) increased in frequency compared to shorter loops, and the RMSDs from the native loop of both the sampling and energy errors increased in magnitude. Furthermore, the test suites used for assessing performance on longer loops were inadequate in size. The problems observed in Ref. 1 for long loop prediction are far from unique to that article. Table I 2–5 presents results taken from work by various groups in predicting loops of length 11 or greater. All these approaches use dihedral angle buildup and candidate selection by a scor- ing or energy function, but they differ in the algorithm details and energy function compositions. A recent article 6 by Monnigmann et al. provides an overview and brief descriptions for the various alternative methods. It should be noted that the results in Table I are not generated on the same test set; however, they do show some common trend. There is a transition of some sort between 10 and 12 residues, which renders the loop prediction problem quali- The Supplementary Material referred to in this article can be found at http://www.interscience.wiley.com/jpages/0887-3585/suppmat/ The first two authors contributed equally to this work. Grant sponsor: NIH; Grant number: GM 52018 (to R.A.F.). *Correspondence to: Richard A. Friesner, Department of Chemistry, Columbia University, New York, NY 10027. E-mail: rich@chem.columbia.edu Received 3 October 2005; Revised 27 January 2006; Accepted 12 March 2006 Published online 22 August 2006 in Wiley InterScience (www.interscience.wiley.com). DOI: 10.1002/prot.21040 PROTEINS: Structure, Function, and Bioinformatics 65:438 – 452 (2006) © 2006 WILEY-LISS, INC.