Improved protein structural class prediction based on chaos game representation Mohammad Olyaee Department of computer engineering, Islamic azad university , Mashhad Branch, Iran mh.olyaee@gmail.com Mehdi yaghubi Department of computer engineering, Islamic azad university, Mashhad Branch, Iran yaghobi@mshdiau.ac.ir Abstract—Determination of protein structural class from sequence information is a challenging task. In this paper, at first we apply chaos game representation to protein sequences and extract two time series then using phase space reconstruction theory and calculate phase space of all time series. Next, applying recurrence quantification analysis (RQA). For each protein sequence 16 characteristic parameters can be calculated with RQA. In order to classification we propose an ensemble classification method. The 10 fold cross validation test is used to test and compare our method with other existing methods. The overall accuracy for two datasets 1189 and 25PDB are 66.7% and 68.2% respectively that has much better performance toward compared methods Keywords- Protein structural class, chaos game representation; phase space; recurrence plot; classifier ensemble I. INTRODUCTION Levitt and Chothia [1] defined the concept of protein structural classes according to this definition, proteins classified into four groups (1)All α class, which includes proteins with only small strands, (2) All β which are formed by strands and with only small amount of helices, (3) α/β proteins which includes both helices and strands that strands are mostly parallel (4) α+β class which includes both helices and mostly anti parallel strands. The structural class is on of the important attributes of protein for example it can increase the accuracy of secondary structure prediction [2] or can help to reduce the searching scope of conformation in tertiary structure prediction[3, 4]. So it is useful to know the protein structural . There are many efforts to solve this problem [5, 6]. Most of them used Amino Acid composition (AA) however many important features about sequence order are missed that reduce the success rate of prediction. In view of this, various methods were presented that including the pair-coupled amino acid composition[3], polypeptide composition [7], Pseduo-amino acid composition[8], functional domain composition[9].Recently Yang et al [10, 11] have successfully presented some different methods to predict structural class of proteins based on chaos game representation(CGR). CGR of protein structures was first proposed by Fisher et al[12]. Later Basu et al [13] and Yu et al [14] proposed several other kinds of CGR of proteins. Yang et al [10] transform protein sequences into nucleotide sequences based on reserve encoding of amino acids [15] and use CGR for DNA sequence based on [16]. Since analysis of CGR is difficult, two time series are extracted from CGR. Secondly a new powerful nonlinear method Recurrence quantification analysis (RQA) is applied to analyze these time series. For each time series eight parameters are achieved then uses 16 (8ൈ ʹሻ parameters to predict the structural classes. Yang et al gained to accuracies 65.8% and 64.2% for low homology 1189 (1092 domains) and 25PDB (1673 domains) datasets respectively . But before analyze time series with RQA, sets embedding dimension (m) 8 and delay time ( ሻ 2 for all time series. In this paper in order to improve this method we use GP algorithm and Auto correlation function and calculate phase space for all time series individually and applying an ensemble of classification algorithms. The 10 fold cross validation test that is much reliable and not time consuming [6] is used for evaluate and compare our method with other existing methods. Experimental results show that our method is much better than other and may play a complementary role to the existing methods. II. MODELS AND METHODS A. Amino Acid sequence to DNA sequence There are several methods for transform protein sequences to nucleotide sequences. In order to do this we use encoding method used by [15] that is listed in Table1. TABLE I. Reserve encoding for amino acids A=GCT G=GGT M=ATG S=TCA C=TGC H=CAC N=AAC T=ACT D=GAC I=ATT P=CCA V=GTG K=AAG Q=CAG W=TGG L=CTA F=TTC R=CGA E=GAC Y=TAC B. Chaos game representation CGR of nucleotide sequences is defined in a square [0ൈ ͳሿሾͲ ൈ ͳሿ where four vertices correspond to four letters A,T,C and G. the first point is placed halfway between the center of square and the vertex corresponding to the first letter of sequences; the ith point is then placed halfway between (i-1)th point and the vertex corresponding to ith 2010 Fourth Asia International Conference on Mathematical/Analytical Modelling and Computer Simulation 978-0-7695-4062-7/10 $26.00 © 2010 IEEE DOI 10.1109/AMS.2010.99 486 Authorized licensed use limited to: SIMON FRASER UNIVERSITY. Downloaded on July 23,2010 at 06:32:28 UTC from IEEE Xplore. Restrictions apply.