Available online at www.sciencedirect.com Computers and Chemical Engineering 32 (2008) 78–88 Prediction of secondary structures of proteins using a two-stage method Fadime ¨ Uney Y¨ uksektepe, ¨ Ozlem Yılmaz, Metin T ¨ urkay ∗ College of Engineering, Ko¸ c University, Rumelifeneri Yolu, Sarıyer, 34450 ˙ Istanbul, Turkey Received 31 October 2006; received in revised form 18 July 2007; accepted 19 July 2007 Available online 25 July 2007 Abstract Protein structure determination and prediction has been a focal research subject in life sciences due to the importance of protein structure in understanding the biological and chemical activities of organisms. The experimental methods used to determine the structures of proteins demand sophisticated equipment and time. A host of computational methods are developed to predict the location of secondary structure elements in proteins for complementing or creating insights into experimental results. However, prediction accuracies of these methods rarely exceed 70%. In this paper, a novel two-stage method to predict the location of secondary structure elements in a protein using the primary structure data only is presented. In the ﬁrst stage of the proposed method, the folding type of a protein is determined using a novel classiﬁcation approach for multi-class problems. The second stage of the method utilizes data available in the Protein Data Bank and determines the possible location of secondary structure elements in a probabilistic search algorithm. It is shown that the average accuracy of the predictions is 74.1% on a large structure dataset. © 2007 Elsevier Ltd. All rights reserved. Keywords: Protein secondary structure prediction; Data classiﬁcation; Mixed-integer linear programming 1. Introduction Proteins are large molecules indispensable for the existence and proper functioning of biological organisms. Proteins are used in structure of cells, which are main constituents of larger formations like tissues and organs. Proteins are also required for proper functioning and regulation of organisms. Understanding the functions of proteins is also a fundamental problem in the discovery of drugs to treat various diseases. Proteins are polymer chains of repeating polypeptide units with side chains attached to each polypeptide unit. The side chain, also known as residues are amino acids with different characteristics. There are 20 different amino acids in natural pro- teins. The sequence of amino acids in a protein chain is given by the primary structure. A typical protein contains 200–300 amino acids but this may increase up to approximately 30,000 in a single chain. The proteins have three local structural con- formations: helices, sheets and other structural conformations such as loops, turns and coils. Helices are spiral strings formed ∗ Corresponding author. Tel.: +90 212 338 1586; fax: +90 212 338 1548. E-mail address: mturkay@ku.edu.tr (M. T ¨ urkay). by hydrogen bonds between CO and NH groups in residues. -Sheets are plain strands formed by stretched polypeptide back- bone. Connecting structures do not have regular shapes; they connect -helices and -strands to each other. Turns enable parts of polypeptide chain to fold onto itself altering the direction of the polypeptide chain to form its three-dimensional shape. The secondary structure of proteins is the structural character- ization of a protein with respect to these three local structural conformations. Proteins are classiﬁed according to their secondary structure content, considering -helices and -strands. Levitt and Chotia (1976) were the ﬁrst to propose classiﬁcation of proteins with four basic types according to their -helix and -sheet con- tent; ‘all-’ proteins consist almost entirely (at least 90%) of -helices: ‘all-’ proteins are composed mostly of -sheets (at least 90%) in their secondary structures; and two intermedi- ate classes (/ and  + ) which have mixed -helices and -sheets. The ‘/’ proteins have approximately alternating, mainly parallel segments of -helices and -sheets. The last class, ‘ + ’ has mixture of all- and all- regions, mostly in an antiparallel fashion (Mount, 2001). Computational approaches to predict protein structures can be very useful in creating insights into protein folding and 0098-1354/$ – see front matter © 2007 Elsevier Ltd. All rights reserved. doi:10.1016/j.compchemeng.2007.07.002