proteins STRUCTURE O FUNCTION O BIOINFORMATICS Prediction of protein secondary structure content for the twilight zone sequences Leila Homaeian, 1 Lukasz A. Kurgan, 1 * Jishou Ruan, 2 Krzysztof J. Cios, 3 and Ke Chen 1 1 Department of Electrical and Computer Engineering, University of Alberta, Edmonton, Alberta, Canada 2 Chern Institute of Mathematics, College of Mathematical Science and LPMC, Nankai University, Tianjin, People’s Republic of China 3 Department of Computer Science and Engineering, University of Colorado at Denver and Health Sciences Center, Denver, Colorado INTRODUCTION Prediction of the secondary structure content, defined as the percentage amount of helices and strands in a pro- tein, provides useful information for characterization of the overall protein structure. The dictionary of secondary structures of proteins (DSSP) annotates each amino acid (AA) as belonging to one of eight secondary structure types 1 : H (helix), G (3 10 -helix), I (pi-helix), B (residue in isolated-bridge), E (extended strand), T (hydrogen bond turn), S (bend), and ‘‘_’’ (any other structure). Typically these eight secondary structure types are reduced to just three groups 2 : helix (which includes types H, G, and I), strand (which includes types E and B), and coil (which includes T, S, and the others). Although the protein secondary structure content prediction can be performed for eight-state representation, 3–6 majority of prior attempts, including this work, address the three- state problem. The first secondary content prediction effort was under- taken in early 1970s when a multiple linear regression (MLR) model was used to predict the content utilizing the composition vector-based sequence representation for a small set of 18 proteins. 7 It was not until 1990s when another content prediction approach was proposed. 8 The authors used composition vector, molecular weight of the sequence and absence/presence of bound Heme group to represent protein sequences and two neural networks to perform the prediction. Still another method used the composition vector representation and analytic vector decomposition technique to predict the content. 9 In late 1990s a MLR model was used on the sequence represen- tation that for the first time used auto-correlation func- tions based on hydrophobicity. 10 Similar methods, which ABSTRACT Secondary protein structure carries information about local structural arrangements, which include three major conforma- tions: a-helices, b-strands, and coils. Significant majority of successful methods for prediction of the secondary structure is based on multiple sequence alignment. However, multiple alignment fails to provide accurate results when a sequence comes from the twilight zone, that is, it is characterized by low (<30%) homology. To this end, we propose a novel method for prediction of secondary structure content through comprehen- sive sequence representation, called PSSC-core. The method uses a multiple linear regression model and introduces a com- prehensive feature-based sequence representation to predict amount of helices and strands for sequences from the twilight zone. The PSSC-core method was tested and compared with two other state-of-the-art prediction methods on a set of 2187 twilight zone sequences. The results indicate that our method provides better predictions for both helix and strand content. The PSSC-core is shown to provide statistically significantly better results when compared with the competing methods, reducing the prediction error by 5–7% for helix and 7–9% for strand content predictions. The proposed feature-based sequence representation uses a comprehensive set of physico- chemical properties that are custom-designed for each of the helix and strand content predictions. It includes composition and composition moment vectors, frequency of tetra-peptides associated with helical and strand conformations, various property-based groups like exchange groups, chemical groups of the side chains and hydrophobic group, auto-correlations based on hydrophobicity, side-chain masses, hydropathy, and conformational patterns for b-sheets. The PSSC-core method provides an alternative for predicting the secondary structure content that can be used to validate and constrain results of other structure prediction methods. At the same time, it also provides useful insight into design of successful protein sequence representations that can be used in developing new methods related to prediction of different aspects of the second- ary protein structure. Proteins 2007; 69:486–498. V V C 2007 Wiley-Liss, Inc. Key words: protein structure; secondary protein structure; sec- ondary structure content; twilight zone; low sequence homology. Grant sponsors: NSERC, MITACS, Liuhui Center for Applied Mathematics, NSFC, Butcher Foundation. *Correspondence to: Lukasz A. Kurgan, Department of Electrical and Computer Engineering, 2nd floor, ECERF (9107 116 Street), University of Alberta, Edmonton, AB, Canada T6G 2V4. E-mail: lkurgan@ece.ualberta.ca Received 13 November 2006; Revised 15 February 2007; Accepted 22 March 2007 Published online 10 July 2007 in Wiley InterScience (www.interscience.wiley. com). DOI: 10.1002/prot.21527 486 PROTEINS V V C 2007 WILEY-LISS, INC.