On the use of secondary structure in protein structure prediction: a bioinformatic analysis Armando D. Solis, S. Rackovsky * Department of Biomathematical Sciences Box 1023, Mount Sinai Medical Center, One Gustave L. Levy Place, New York, NY 10029, USA Received 10 May 2003; received in revised form 19 August 2003; accepted 19 August 2003 Abstract The amount of structural information encoded in secondary structure can be measured by its ability to specify the correct peptide backbone conformation of protein chains. Using methodology derived from information theory, we generate optimized distributions of backbone phi – psi dihedral angle pairs given either correct or predicted three-state secondary structure. Entropy measurements on these distributions provide a means to determine the effect of secondary structure knowledge on identifying the actual 3D conformation of protein chains. We find that only a modest fraction of the total uncertainty in phi – psi conformation (from 14 to 38%, at 20 – 908 resolutions, respectively) is resolved even with perfect knowledge of secondary structure. We further show that prediction of secondary structures, because of an accuracy ceiling below 80%, degrades structural information substantially. If prediction accuracy is below 50%, virtually no advantage is gained from using the prediction. Moreover, even state-of-the-art prediction accuracy of 75% retains less than one-third of the structural information encoded in secondary structure. We demonstrate that the level of structural description affects the amount of information extracted. The effort to provide as much structural detail as possible, while faced with a limited structural data set, results in an optimum resolution in the vicinity of a 208- partition of the ðf; cÞ plane. We show that structural information increases exponentially with prediction accuracy, revealing that even marginal gains in the performance of secondary structure prediction algorithms are important for the retention of structural information. We observe that different kinds of secondary structure prediction outputs (single-state prediction, single-state prediction with a confidence index, and three-state probability prediction) do not differ greatly in the amount of structural information they yield, so long as the methods formulated in this work to generate propensity distributions are applied appropriately. The optimal phi – psi probability distributions developed here may be useful in biasing searches in structure space. We discuss the sources of the degradation of information caused by errors in secondary structure prediction, and their consequences for the prediction of the 3D conformation of protein chains. q 2003 Elsevier Ltd. All rights reserved. Keywords: Protein bioinformatics; Information theory; Secondary structure prediction 1. Introduction The quality and resolution of structural features and patterns detected by statistical analysis of protein backbone chains is dependent on the descriptor used to specify structural data [1,2]. The most common backbone structural description, the assignment of each residue to one of the three types of secondary structure (28), provides the simplest means to identify repeating backbone patterns [3]. How- ever, since only three states (helix, extended, and coil) are used in this classification, potentially informative details of the local sequence dependence of backbone conformation are not efficiently recognized. Nonetheless, the 28 prediction problem has become a benchmark in the field of protein structure prediction, and has prompted the development of numerous prediction algorithms over the past two decades [4,5]. Higher resolution structural descriptors, such as the phi – psi dihedral angle description [1,6] and the C a trace [2,7–9], are more successful in cataloguing nuances of the local sequence – structure relationship. The goal of protein structure prediction is to assign the correct 3D conformation to a given amino acid sequence [10]. Because even complete knowledge of the secondary structure of a protein is not sufficient to identify its folded structure, 28 prediction schemes are only an intermediate step. Recent work has aimed to close the secondary – tertiary structure gap via homology modelling and other means [11–13]. As they assume the membership of all patterns of 0032-3861/$ - see front matter q 2003 Elsevier Ltd. All rights reserved. doi:10.1016/j.polymer.2003.10.065 Polymer 45 (2004) 525–546 www.elsevier.com/locate/polymer * Corresponding author. Tel.: þ 1-212-241-5851; fax: þ1-212-860-4630. E-mail addresses: shelly@camelot.mssm.edu (S. Rackovsky), armando.solis@mssm.edu (A.D. Solis).