Combining segmental semi-Markov models with neural networks for protein secondary structure prediction Niranjan P. Bidargaddi, Madhu Chetty à , Joarder Kamruzzaman Gippsland School of Information Technology, Monash University, VIC 3842, Australia article info Article history: Received 3 October 2006 Received in revised form 16 April 2009 Accepted 21 April 2009 Communicated by: M.-J. Er Available online 18 May 2009 Keywords: Secondary structure Single sequence Neural network Semi-hidden Markov model Graphical models Prediction Hybridization abstract Motivation: Predicting the secondary structure of proteins from a primary sequence alone has been variously approached from either a classification or a generative model perspective. The most prominent classification methods have used neural networks, which involves mappings from a local window of residues in the sequence to the structural state of the central residue in the window, thus capturing the local interactions effectively. However, they fail to capture distant interactions among residues. The generative models based on Bayesian segmentation capture sequence structure relationships using generalized hidden Markov models with explicit state duration. They capture non-local interactions through a joint sequence-structure probability distribution based on structural segments. In this paper, we investigate a combined architecture of Bayesian segmentation at the first stage and neural network at the second stage which captures both local and non-local correlation, to increase the single sequence prediction accuracy. The combined architecture is further enhanced by using neural network optimization and ensemble techniques. Results: The proposed architecture has been built and tested on two widely studied databases comprising 480 and 608 protein sequences, respectively. It achieved accuracies of above 71%, which is comparable to the highest accuracies reported so far for single sequence methods, without using the evolutionary information provided by multiple sequence alignments. The required data sets and program codes are available at http://www.gippsland.monash.edu.au/research/publish/neurocompu- ting.zip. & 2009 Elsevier B.V. All rights reserved. 1. Introduction Structural information of proteins provides a more detailed insight into protein–protein interactions and their functionalities. Accurate prediction of protein structure from the primary sequence alone is essential, as it is well known that the information necessary for protein folding resides completely within the protein structure [1]. Predicting secondary structure (SS) of a protein from a primary sequence is an important step in determining the three-dimensional structure. SS predictions are compared with secondary structure defini- tions from known structures. DSSP [2], DEFINE [3] and STRIDE [4] are a few of the algorithms used to define SS from crystal- lographically determined coordinates. The most widely used assignment algorithm, DSSP, defines eight different SS states denoted by single letter codes: h (a-helix), t (b-turn), s (bend), i (p-helix), g (3 10 -helix), e (b-strand), b (b-bridge) and c (others). The eight structural states are translated into three SS states: (1) a-helix ðHÞ, corresponding to g and h in the DSSP code [2], (2) b-sheet ðEÞ to e and b and (3) coil ðCÞ to all the others in the DSSP definitions. a-helices are formed by backbone hydrogen bonds linking residues i and i þ 4. In beta sheets, hydrogen bonds link two sequence segments in parallel or antiparallel fashion. The SS is characterized by non-local interactions, position- specific preferences, and correlations among the neighboring amino acid residues of the primary sequence [5]. The SS prediction problem has been widely treated as a classification problem. This involves deriving the three states of the amino acid residues in the protein sequence. Some of the most successful recent methods based on neural networks such as PHD [6] and PSIPRED [7] involved mappings from a local window of residues in the sequence to the structural state of the central residue in the window. PREDATOR developed by Frishman and Argos [8] was based on another interesting approach of utilizing underlying physical principles of secondary structure formation. It reached an accuracy of 68% without using evolutionary information. A number of studies in recent times for single sequence prediction methods without using homolog information [9–12] have achieved accuracy level in the range of 68–71%. The SS prediction problem was seen from an alternative perspective of generative models through Bayesian segmentation by Schmidler et al. [13]. They developed a statistical generative model for sequence ARTICLE IN PRESS Contents lists available at ScienceDirect journal homepage: www.elsevier.com/locate/neucom Neurocomputing 0925-2312/$ - see front matter & 2009 Elsevier B.V. All rights reserved. doi:10.1016/j.neucom.2009.04.017 à Corresponding author. Tel.: +613 9902 7148; fax: +613 9902 6842. E-mail address: Madhu.Chetty@infotech.monash.edu.au (M. Chetty). Neurocomputing 72 (2009) 3943–3950