270 Current Protein and Peptide Science, 2009, 10, 270-285 1389-2037/09 $55.00+.00 © 2009 Bentham Science Publishers Ltd. A Guide to Template Based Structure Prediction Xiaotao Qu 1 , Rosemarie Swanson 2 , Ryan Day 3 and Jerry Tsai 3,* 1 Moffitt Cancer Center, 12902 Magnolia Drive, Tampa, FL 33612, USA; 2 Department of Biochemistry and Biophysics, Texas A&M University, College Station, TX 77843, USA; 3 Department of Chemistry, University of the Pacific, 3601 Pacific Avenue, Stockton, CA 95211, USA Abstract: Template based protein structure prediction (commonly referred to as homology or comparative modeling) uses knowledge of solved structures to model a protein sequence’s native or true fold. First, a parent structure is found and then a template structure is built by mapping the target sequence onto the parent structure. This putative structure is refined us- ing a combination of backbone moves, side-chain packing, and loop modeling. Template based protein structure predic- tion has always held great promise to produce atomically accurate models close to the native conformation based on two major assumptions. First, similar sequences exhibit similar protein folds. Second, soluble proteins populate a discrete fold space with many representatives already solved in our Protein Data Bank (PDB). Ironically, beginning so close to the na- tive structure is also the primary source of problems confronting this method and is the reason for the lack of progress in this category of structure prediction. In this review, the general concepts and procedures for template based structure pre- diction are outlined based on the following topics: sequence alignment, parent structure selection, template structure building, refinement, evaluation, and final structure selection. Then, a description of established software and algorithms is provided where the advantages and limitations of the different methods will be pointed out. This is followed by a dis- cussion of the developments in template based structure prediction up to the 7 th Critical Assessment of Structure Predic- tion meeting. Lastly, we will address the increased difficulty in improving templates that start so close to the native struc- ture, and discuss the improvements needed in this field. Keywords: Template based modeling/prediction (TBM), structure prediction, side-chain packing, structure refinement, loop modeling, multiple sequence alignment, model evaluation, structure selection. INTRODUCTION While commonly known as homology modeling and more recently, comparative modeling [1-3], the method of creating a prediction of an unknown structure using a close structural homolog is better described as template based modeling/prediction (TBM) of protein structure (Fig. 1). This is now the accepted terminology in the protein structure prediction community. The new designation is more general and allows for the distinct contrast to template free structure prediction, more commonly known as ab initio or de novo modeling [4-6]. Because it is believed that a representative of every protein fold will eventually be solved [7-13], tem- plate based structure prediction holds a great deal of promise for the field of protein structure modeling. The availability of a representative fold as a starting template for a sequence of unknown structure offers the quickest path to generating a model of the real structure. Furthermore, template based methods produce the most reliable and accurate predictions of protein structure aside from experimental determination [14, 15]. Unfortunately, the imprecise variations between the close template and the real structure produce the major source of challenges facing this field today. In fact, template based structure prediction has been trying to overcome these obstacles since its inception. In the following discussion, the primary problems inherent to starting with inexact templates *Address correspondence to this author at the Department of Chemistry, University of the Pacific, 3601 Pacific Avenue, Stockton, CA 95211, USA; Tel: (209) 946-2298; Fax: (209) 946-2607; E-mail: jtsai@pacific.edu will be explained in more detail. For consistency and clarity, we will adhere to the following terminology throughout this review, as shown in Fig. (1). “Native” refers to the experi- mentally determined structure of the target sequence. “Tar- get” refers to the protein being predicted/modeled. “Parent” refers to an initial known protein structure that is used to create the starting “template” structure. Finally, a “model” structure is any prediction of the native structure; however, it usually indicates structures refined from the starting tem- plate. Although the delineation between steps is somewhat arbi- trary since many methods combine steps, we have organized our discussion of TBM into the four steps [16-18] outlined in Fig. (1). First, the parent structure(s) are identified using sequence searches against the known structure database (the Protein Data Bank [19]). Second, the initial template struc- ture(s) are constructed by aligning the target sequence to the parent structure and by identifying conserved and variable regions. Third, the structure(s) are refined through a combi- nation of backbone moves, side-chain packing, and loop modeling of highly variable regions. This step is an attempt to sample the conformational space of native structure and usually a number (on the order of thousands) of potential models are created. So, the last step is to evaluate these models and choose the one that is nearest in structure to the native. In many methods described below, the first and sec- ond steps occur concurrently, and the same can be said for the third and fourth steps. Newer approaches include de novo prediction of variable regions during refinement as well as