Available online at www.sciencedirect.com
Computers and Chemical Engineering 32 (2008) 78–88
Prediction of secondary structures of proteins
using a two-stage method
Fadime
¨
Uney Y¨ uksektepe,
¨
Ozlem Yılmaz, Metin T ¨ urkay
∗
College of Engineering, Ko¸ c University, Rumelifeneri Yolu, Sarıyer, 34450
˙
Istanbul, Turkey
Received 31 October 2006; received in revised form 18 July 2007; accepted 19 July 2007
Available online 25 July 2007
Abstract
Protein structure determination and prediction has been a focal research subject in life sciences due to the importance of protein structure in
understanding the biological and chemical activities of organisms. The experimental methods used to determine the structures of proteins demand
sophisticated equipment and time. A host of computational methods are developed to predict the location of secondary structure elements in
proteins for complementing or creating insights into experimental results. However, prediction accuracies of these methods rarely exceed 70%. In
this paper, a novel two-stage method to predict the location of secondary structure elements in a protein using the primary structure data only is
presented. In the first stage of the proposed method, the folding type of a protein is determined using a novel classification approach for multi-class
problems. The second stage of the method utilizes data available in the Protein Data Bank and determines the possible location of secondary
structure elements in a probabilistic search algorithm. It is shown that the average accuracy of the predictions is 74.1% on a large structure dataset.
© 2007 Elsevier Ltd. All rights reserved.
Keywords: Protein secondary structure prediction; Data classification; Mixed-integer linear programming
1. Introduction
Proteins are large molecules indispensable for the existence
and proper functioning of biological organisms. Proteins are
used in structure of cells, which are main constituents of larger
formations like tissues and organs. Proteins are also required for
proper functioning and regulation of organisms. Understanding
the functions of proteins is also a fundamental problem in the
discovery of drugs to treat various diseases.
Proteins are polymer chains of repeating polypeptide units
with side chains attached to each polypeptide unit. The side
chain, also known as residues are amino acids with different
characteristics. There are 20 different amino acids in natural pro-
teins. The sequence of amino acids in a protein chain is given
by the primary structure. A typical protein contains 200–300
amino acids but this may increase up to approximately 30,000
in a single chain. The proteins have three local structural con-
formations: helices, sheets and other structural conformations
such as loops, turns and coils. Helices are spiral strings formed
∗
Corresponding author. Tel.: +90 212 338 1586; fax: +90 212 338 1548.
E-mail address: mturkay@ku.edu.tr (M. T ¨ urkay).
by hydrogen bonds between CO and NH groups in residues.
-Sheets are plain strands formed by stretched polypeptide back-
bone. Connecting structures do not have regular shapes; they
connect -helices and -strands to each other. Turns enable parts
of polypeptide chain to fold onto itself altering the direction
of the polypeptide chain to form its three-dimensional shape.
The secondary structure of proteins is the structural character-
ization of a protein with respect to these three local structural
conformations.
Proteins are classified according to their secondary structure
content, considering -helices and -strands. Levitt and Chotia
(1976) were the first to propose classification of proteins with
four basic types according to their -helix and -sheet con-
tent; ‘all-’ proteins consist almost entirely (at least 90%) of
-helices: ‘all-’ proteins are composed mostly of -sheets (at
least 90%) in their secondary structures; and two intermedi-
ate classes (/ and + ) which have mixed -helices and
-sheets. The ‘/’ proteins have approximately alternating,
mainly parallel segments of -helices and -sheets. The last
class, ‘ + ’ has mixture of all- and all- regions, mostly in
an antiparallel fashion (Mount, 2001).
Computational approaches to predict protein structures can
be very useful in creating insights into protein folding and
0098-1354/$ – see front matter © 2007 Elsevier Ltd. All rights reserved.
doi:10.1016/j.compchemeng.2007.07.002