Southeast Europe Journal of Soft Computing Available online: www.scjournal.com.ba VOL.2 NO.2September 2013 - ISSN 2233 – 1859 Regression Analysis to Predict the Secondary Structure of Proteins Betul Akcesme and Faruk B. Akcesme International University of Sarajevo, Faculty of Engineering and Natural Sciences, Hrasnicka Cesta 15, Ilidža 71210 Sarajevo, Bosnia and Herzegovina betul.cicek@yahoo.com; fakcesme@ius.edu.ba Article Info Article history: Article received Sep.2014 Received in revised form Oct 2014 Keywords: Secondary structure; Conformation of proteins; Statistical methods Abstract A method is presented for protein secondary structure prediction based on the use of multidimensional regression. 200 proteins are chosen from RCSB Protein Database. Their secondary structures obtained through x- ray crystallography analyses are downloaded from the same source. Primary and secondary structure of proteins are concatenated separately to create a sequence of 169 026 residues. First 150 000 of the amino acid residues and corresponding secondary structures are chosen to create a regression model. The remaining 19 026 residues are used for testing. Since we expect three outputs α-helices "S", β-sheets "H", and coiled coils "C", our regression modes consists of 3 × 20 × 23 parameters. These parameters are tuned and a correct classification rate of 62.50% is achieved on the test data. Furthermore, the performance of the regression model compared with online secondary structure estimation algorithms on 14 unused proteins, and the performance of the regression model is found comparable with the online estimation tools. 1. INTRODUCTION Large-scale sequencing projects produced a large number of protein sequences. In 1993 the number was 26,000 (Bairoch & Boeckmann, 1963; Ewbank & Creighton, 1992) sequences, but before the end of the century the number easily past the 500,000 limit. Today, at the end of the year 2014 the number reached to 546,790. To compare the number of known proteins sequences, the number of proteins which is known by structure is still very limited, in 1993 it was at about 1000 (Bernstein et al., 1977). Today it reached at 105,025 increased efforts focused on narrowing the widening gap. The most reliable prediction of the structure of new proteins is done by detection of significant similarities to proteins of known structure (Taylor & Orengo, 1989; Sander & Schneider, 1991; Vriend & Sander, 1991). But only about one-seventh of new sequences have similarities to known structures (Bork et al., 1992) in the years 1993. Figure 1. Number of proteins whose structures are known