International Journal of Computational Intelligence and Applications Vol. 6, No. 4 (2006) 551–567 c  Imperial College Press PROTEIN SECONDARY STRUCTURE PREDICTION USING SUPPORT VECTOR MACHINES AND A NEW FEATURE REPRESENTATION JAYAVARDHANA GUBBI * , DANIEL T. H. LAI † and MARIMUTHU PALANISWAMI ‡ Department of Electrical and Electronic Engineering The University of Melbourne Victoria 3010, Australia * jrgl@ee.unimelb.edu.au † dlai@ee.unimelb.edu.au ‡ swami@ee.unimelb.edu.au MICHAEL PARKER St. Vincent’s Institute of Medical Research 9 Princes Street, Fitzroy Victoria 3065, Australia mparker@svi.edu.au Received 4 July 2006 Revised 10 January 2007 Accepted 24 January 2007 Knowledge of the secondary structure and solvent accessibility of a protein plays a vital role in the prediction of fold, and eventually the tertiary structure of the pro- tein. A challenging issue of predicting protein secondary structure from sequence alone is addressed. Support vector machines (SVM) are employed for the classiﬁcation and the SVM outputs are converted to posterior probabilities for multi-class classiﬁcation. The eﬀect of using Chou–Fasman parameters and physico-chemical parameters along with evolutionary information in the form of position speciﬁc scoring matrix (PSSM) is analyzed. These proposed methods are tested on the RS126 and CB513 datasets. A new dataset is curated (PSS504) using recent release of CATH. On the CB513 dataset, sevenfold cross-validation accuracy of 77.9% was obtained using the proposed encod- ing method. A new method of calculating the reliability index based on the number of votes and the Support Vector Machine decision value is also proposed. A blind test on the EVA dataset gives an average Q 3 accuracy of 74.5% and ranks in top ﬁve protein structure prediction methods. Supplementary material including datasets are available on http://www.ee.unimelb.edu.au/ISSNIP/bioinf/. Keywords : Protein secondary structure prediction; support vector machines; position speciﬁc scoring matrix (PSSM); Chou–Fasman parameters; Kyte–Dolittle hydrophobic- ity; Grantham polarity; reliability index; novel encoding scheme. 1. Introduction The results of the Human Genome project have left a signiﬁcant gap between the availability of the protein sequence and its corresponding structure. 1 The 551