Using Pseudo Amino Acid Composition to Predict Protein Structural Class: Approached by Incorporating 400 Dipeptide Components HAO LIN, QIAN-ZHONG LI Laboratory of Theoretical Biophysics, Department of Physics, College of Sciences and Technology, Inner Mongolia University, Hohhot 010021, People’s Republic of China Received 25 August 2006; Revised 22 September 2006; Accepted 27 September 2006 DOI 10.1002/jcc.20554 Published online 1 March 2007 in Wiley InterScience (www.interscience.wiley.com). Abstract: The proteins structure can be mainly classiﬁed into four classes: all-, all-, /, and  þ  protein according to their chain fold topologies. For the purpose of predicting the protein structural class, a new predicting algorithm, in which the increment of diversity combines with Quadratic Discriminant analysis, is presented to study and predict protein structural class. On the basis of the concept of the pseudo amino acid composition (Chou, Pro- teins: Struct Funct Genet 2001, 43, 246; Erratum: Proteins Struct Funct Genet 2001, 44, 60), 400 dipeptide compo- nents and 20 amino acid composition are, respectively, selected as parameters of diversity source. Total of 204 non- homologous proteins constructed by Chou (Chou, Biochem Biophys Res Commun 1999, 264, 216) are used for training and testing the predictive model. The predicted results by using the pseudo amino acids approach as pro- posed in this paper can remarkably improve the success rates, and hence the current method may play a complemen- tary role to other existing methods for predicting protein structural classiﬁcation. q 2007 Wiley Periodicals, Inc. J Comput Chem 28: 1463–1466, 2007 Key words: protein structural class; increment of diversity; quadratic discriminant; dipeptide correlation Introduction The knowledge of protein three-dimensional structures plays an important role in understanding their function. Most globular protein domains of known structure are generally categorized into four main structural classes: all-, all-, /, and  þ  protein according to their chain fold topologies. 1 It has been suggested that the structural class of a protein correlates strongly with its amino-acid composition. 2,3 And lots of efforts had been made to predict the protein structural class based on their amino acid composition or functional domain composition. 4–27 Recently, based on support vector machine (SVM) 28 and the Augmented covariant discriminant, 29 the so-called pseudo amino acid composition (PseAA) was introduced as predictive parame- ters to improve predictive accuracy. For the 204 proteins con- structed by Chou, 14 the overall sensitivity of two algorithms is 85.3% and 89.7%, respectively. In this article, a new algorithm that is increment of diversity (ID) combined with quadratic discriminant analysis (IDQD) is presented to predict protein structural class. The ID which was ﬁrst introduced and employed in biogeography is a kind of in- formation description on state space and a measure of whole uncertainly and total information of a system. 30 Recently, the ID algorithm and the IDQD model has, respectively, been applied in the recognition of protein structural class 31 and the exon– intron splice site prediction. 32 Here, we generalize the IDQD model from two-classes predictive problem to multi-classes pre- diction problem. On the basis of the absolute frequencies of 400 dipeptides and 20 amino acids, eight ID values are calculated and selected as inputting parameters of quadratic discriminant (QD). The structural class of an arbitrary protein may be pre- dicted by the minimum QD value. The predictive results of the jackknife cross-validation test show signiﬁcant improvement compared with other results. Materials and Theoretical Algorithm Database The 204 proteins studied here were constructed by Chou. 14 These proteins derived from SCOP can be classiﬁed into four structural classes: 52 all-, 61 all-, 45 /, and 46  þ . The average sequence similarity scores in each protein class are all lower than 30%. Therefore, the proteins are not similarity to each other in this database. Contract/grant sponsor: National Natural Science Foundation of China; contract/grant number: 30560039 Correspondence to: Q.-Z. Li; e-mail: qzli@imu.edu.cn q 2007 Wiley Periodicals, Inc.