Improved PEP-FOLD Approach for Peptide and Miniprotein Structure Prediction Yimin Shen, ,§,# Julien Maupetit, ,§,# Philippe Derreumaux, ,,§, and Pierre Tue ́ ry* ,,§,# INSERM U973, MTi, F-75205 Paris, France Laboratoire de Biochimie The ́ orique, UPR 9080 CNRS, Institut de Biologie Physico-Chimique, F-75005 Paris, France Institut Universitaire de France, 103 Boulevard Saint-Michel, 75005, Paris, France § Univ Paris Diderot, Sorbonne Paris Cite ́ , F-75205 Paris, France * S Supporting Information ABSTRACT: Peptides and mini proteins have many bio- logical and biomedical implications, which motivates the development of accurate methods, suitable for large-scale experiments, to predict their experimental or native con- formations solely from sequences. In this study, we report PEP-FOLD2, an improved coarse grained approach for peptide de novo structure prediction and compare it with PEP-FOLD1 and the state-of-the-art Rosetta program. Using a benchmark of 56 structurally diverse peptides with 2552 amino acids and a total of 600 simulations for each system, PEP-FOLD2 generates higher quality models than PEP-FOLD1, and PEP-FOLD2 and Rosetta generate near-native or native models for 95% and 88% of the targets, respectively. In the situation where we do not have any experimental structures at hand, PEP-FOLD2 and Rosetta return a near-native or native conformation among the top ve best scored models for 80% and 75% of the targets, respectively. While the PEP-FOLD2 prediction rate is better than the ROSETTA prediction rate by 5%, this improvement is non-negligible because PEP-FOLD2 explores a larger conformational space than ROSETTA and consists of a single coarse-grained phase. Our results indicate that if the coarse-grained PEP-FOLD2 method is approaching maturity, we are not at the end of the game of mini-protein structure prediction, but this opens new perspectives for large-scale in silico experiments. INTRODUCTION Fast and accurate peptide structure characterization remains a long-standing goal in structural biology and peptide engineering since peptides up to 50 amino acids represent a source of novel antibiotics and therapeutics. 1 In addition, these amino acid sizes can fold autonomously and be the functional centers of full length proteins (e.g., C1, UBA, and WW, to cite some). 24 One major obstacle in predicting peptide structures, in contrast to larger proteins, is that only a small number of solution structures have been characterized and are available in structural databases. On October 1st, 2013, the number of entries of the Protein Data Bank (PDB) 5 corresponding to isolated proteins of less than 51 amino acids was 2057, and only 799 proteins had less than 30% sequence identity and their structures not solved in a membrane environment. In addition, de novo sequences can deviate from those in the PDB by more than 70% sequence identity, making the use of comparative modeling techniques unreliable when no experimental information is available. For instance, it is remarkable that the de novo peptide with the helix-turn-helix motif designed in 2004 (PDB 1vrz) or with the beta-alpha-beta motif (PDB 2ki0) designed in 2009 still do not have any homologue in the PDB. Considering the number of new sequences that are delivered by each genome project, we need to go beyond time- consuming simulations of all-atom systems in explicit solvent, though molecular dynamics studies show success in folding diverse structurally proteins with 1080 amino acids by using the specially designed Anton computer 6 or the Folding-at-home project. 7 Present estimates of the number of hypothetical peptide coding sequences in the complete prokaryotic genomes available today are on the order of 1.5 million. 57 In eucaryotes, the number of peptide candidates is even higher, with estimates of the number of venom peptides on the order of 12 millions. 8 This highlights the need for fast approaches to model the structure of peptide and small proteins. The most ecient and rapid methods are multiscale in character. Such methods start sampling with low resolution models, use fragment assembly (FA) methods and then select some conformations for subsequent full-atom renements. These include the widely used Rosetta, 9,10 I-Tasser, 11 and Quark 12 methods. Other Web servers include PepStr, 13 Bhageerath, 14 and Peplook. 15 Other programs such as Zipping and Assembly, 16 the AWSEM-based approach, 17 the conforma- tional space annealing, 18 GPS, 19 and replica exchange molecular dynamics simulations (REMD) with OPEP 20 are not open and Received: April 5, 2014 Published: August 20, 2014 Article pubs.acs.org/JCTC © 2014 American Chemical Society 4745 dx.doi.org/10.1021/ct500592m | J. Chem. Theory Comput. 2014, 10, 47454758