proteins STRUCTURE O FUNCTION O BIOINFORMATICS TASSER_low-zsc: An approach to improve structure prediction using low z-scoreranked templates Shashi B. Pandit and Jeffrey Skolnick * Center for the Study of Systems Biology, School of Biology, Georgia Institute of Technology, Atlanta, Georgia 30318 INTRODUCTION Despite significant progress, the prediction of protein structure remains an unsolved problem in computational structural biology. 1–3 Historically, structure prediction methods have been divided into three general catego- ries: comparative modeling (CM), 1,4–6 threading 7–11 , and free modeling (FM). 12–15 The basic objective of CM and threading approaches is to iden- tify a set of structurally related template proteins (with known tertiary struc- ture) to the target sequence. 5,9 CM methods identify template proteins with a clear evolutionary relationship to the target by using sequence-based meth- ods, 5,16 whereas threading, by including protein structural information, strives to identify template proteins that have a similar fold as the target irrespective of their evolutionary relationship. 3,8,9 Because of the conver- gence of threading and CM methods, both are referred to as template-based modeling (TBM). 17 In TBM, once the related template is identified, the tar- get sequence is aligned to the template structure either indirectly by per- forming a sequence alignment and then transferring this alignment to the associated position in the structure or by directly incorporating structural in- formation into the alignment procedure. 5,9–11 A full-length model is then generated by building the chain in the unaligned regions of the template. This full-length structure is then refined, with the goal of improving model quality relative to the initial TBM-provided alignment. In contrast, in tem- plate FM, one does not use any global template structural information as an input. 12,13 Thus, the possibility of assembling a novel fold exists. In recent years, TBM has emerged as the most robust approach to pro- tein structure prediction. 3,17 Advances in better template identification and improved alignment accuracy resulted from the use of profile–profile alignments, 18–20 inclusion of structural properties such as solvent accessi- bility 21 and structural profiles, 8,10,11,22–24 hidden Markov models, 25,26 machine-learning approaches, 27,28 and the employment of meta-serv- ers. 29–31 Model refinement can be achieved by using multiple templates to generate better alignments, 32,33 iterative refinement 34,35 as well as physics-based and evolution-based potentials. 12,35,36 The ultimate success of TBM requires that similar structures to those adopted by the target be found in the Protein Data Bank (PDB). 37 Recent studies have demon- strated that the current PDB library is most likely complete hence it can provide templates for all compact, single domain proteins from which low- Additional Supporting Information may be found in the online version of this article. Grant sponsor: National Institutes of Health; Grant number: GM-48835. *Correspondence to: Jeffrey Skolnick, Center for the Study of Systems Biology, School of Biology, Georgia Institute of Technology, Atlanta, Georgia 30318. E-mail: skolnick@gatech.edu. Received 12 February 2010; Revised 12 May 2010; Accepted 29 May 2010 Published online 10 June 2010 in Wiley Online Library (wileyonlinelibrary.com). DOI: 10.1002/prot.22791 ABSTRACT In a variety of threading methods, often poorly ranked (low z-score) templates have good alignments. Here, a new method, TASSER_low-zsc that identifies these low z-score–ranked templates to improve protein structure prediction accu- racy, is described. The approach consists of clustering of threading templates by af- finity propagation on the basis of struc- tural similarity (thread_cluster) followed by TASSER modeling, with final models selected by using a TASSER_QA variant. To establish the generality of the approach, templates provided by two threading methods, SP 3 and SPARKS 2 , are examined. The SP 3 and SPARKS 2 bench- mark datasets consist of 351 and 357 me- dium/hard proteins (those with moderate to poor quality templates and/or align- ments) of length 250 residues, respec- tively. For SP 3 medium and hard targets, using thread_cluster , the TM-scores of the best template improve by 4 and 9% over the original set (without low z-score tem- plates) respectively; after TASSER model- ing/refinement and ranking, the best model improves by 7 and 9% over the best model generated with the original template set. Moreover, TASSER_low-zsc generates 22% (43%) more foldable me- dium (hard) targets. Similar improve- ments are observed with low-ranked tem- plates from SPARKS 2 . The template clus- tering approach could be applied to other modeling methods that utilize multiple templates to improve structure prediction. Proteins 2010; 78:2769–2780. V V C 2010 Wiley-Liss, Inc. Key words: structure prediction; thread- ing; TASSER; tertiary structure. V V C 2010 WILEY-LISS, INC. PROTEINS 2769