Improved protein structural class prediction based on chaos game representation
Mohammad Olyaee
Department of computer engineering,
Islamic azad university ,
Mashhad Branch, Iran
mh.olyaee@gmail.com
Mehdi yaghubi
Department of computer engineering,
Islamic azad university,
Mashhad Branch, Iran
yaghobi@mshdiau.ac.ir
Abstract—Determination of protein structural class from
sequence information is a challenging task. In this paper, at
first we apply chaos game representation to protein sequences
and extract two time series then using phase space
reconstruction theory and calculate phase space of all time
series. Next, applying recurrence quantification analysis
(RQA). For each protein sequence 16 characteristic
parameters can be calculated with RQA. In order to
classification we propose an ensemble classification method.
The 10 fold cross validation test is used to test and compare
our method with other existing methods. The overall accuracy
for two datasets 1189 and 25PDB are 66.7% and 68.2%
respectively that has much better performance toward
compared methods
Keywords- Protein structural class, chaos game
representation; phase space; recurrence plot; classifier ensemble
I. INTRODUCTION
Levitt and Chothia [1] defined the concept of protein
structural classes according to this definition, proteins
classified into four groups (1)All α class, which includes
proteins with only small strands, (2) All β which are formed
by strands and with only small amount of helices, (3) α/β
proteins which includes both helices and strands that strands
are mostly parallel (4) α+β class which includes both helices
and mostly anti parallel strands. The structural class is on of
the important attributes of protein for example it can
increase the accuracy of secondary structure prediction [2]
or can help to reduce the searching scope of conformation in
tertiary structure prediction[3, 4]. So it is useful to know the
protein structural . There are many efforts to solve this
problem [5, 6]. Most of them used Amino Acid composition
(AA) however many important features about sequence
order are missed that reduce the success rate of prediction.
In view of this, various methods were presented that
including the pair-coupled amino acid composition[3],
polypeptide composition [7], Pseduo-amino acid
composition[8], functional domain composition[9].Recently
Yang et al [10, 11] have successfully presented some
different methods to predict structural class of proteins
based on chaos game representation(CGR). CGR of protein
structures was first proposed by Fisher et al[12]. Later Basu
et al [13] and Yu et al [14] proposed several other kinds of
CGR of proteins. Yang et al [10] transform protein
sequences into nucleotide sequences based on reserve
encoding of amino acids [15] and use CGR for DNA
sequence based on [16]. Since analysis of CGR is difficult,
two time series are extracted from CGR. Secondly a new
powerful nonlinear method Recurrence quantification
analysis (RQA) is applied to analyze these time series. For
each time series eight parameters are achieved then uses 16
(8ൈ ʹሻ parameters to predict the structural classes. Yang et
al gained to accuracies 65.8% and 64.2% for low homology
1189 (1092 domains) and 25PDB (1673 domains) datasets
respectively . But before analyze time series with RQA, sets
embedding dimension (m) 8 and delay time ( ሻ 2 for all
time series. In this paper in order to improve this method we
use GP algorithm and Auto correlation function and
calculate phase space for all time series individually and
applying an ensemble of classification algorithms. The 10
fold cross validation test that is much reliable and not time
consuming [6] is used for evaluate and compare our method
with other existing methods. Experimental results show
that our method is much better than other and may play a
complementary role to the existing methods.
II. MODELS AND METHODS
A. Amino Acid sequence to DNA sequence
There are several methods for transform protein
sequences to nucleotide sequences. In order to do this we use
encoding method used by [15] that is listed in Table1.
TABLE I. Reserve encoding for amino acids
A=GCT G=GGT M=ATG S=TCA C=TGC
H=CAC N=AAC T=ACT D=GAC I=ATT
P=CCA V=GTG K=AAG Q=CAG W=TGG
L=CTA F=TTC R=CGA E=GAC Y=TAC
B. Chaos game representation
CGR of nucleotide sequences is defined in a square
[0ൈ ͳሿሾͲ ൈ ͳሿ where four vertices correspond to four letters
A,T,C and G. the first point is placed halfway between the
center of square and the vertex corresponding to the first
letter of sequences; the ith point is then placed halfway
between (i-1)th point and the vertex corresponding to ith
2010 Fourth Asia International Conference on Mathematical/Analytical Modelling and Computer Simulation
978-0-7695-4062-7/10 $26.00 © 2010 IEEE
DOI 10.1109/AMS.2010.99
486
Authorized licensed use limited to: SIMON FRASER UNIVERSITY. Downloaded on July 23,2010 at 06:32:28 UTC from IEEE Xplore. Restrictions apply.