J. Biomedical Science and Engineering, 2009, 2, 136-143 Published Online June 2009 in SciRes. http://www.scirp.org/journal/jbise JBiSE Prediction of protein folding rates from primary sequence by fusing multiple sequential features Hong-Bin Shen 1,3,* , Jiang-Ning Song 2 , Kuo-Chen Chou 1,3 1 Institute of Image Processing & Pattern Recognition, Shanghai Jiaotong University, 800 Dongchuan Road, Shanghai, 200240, China; 2 Bioinformatics Center, Institute for Chemical Research, Kyoto University, Gokasho, Uji, Kyoto 611-0011, Japan; 3 Gordon Life Sci- ence Institute, 13784 Torrey Del Mar Drive, San Diego, California 92130, USA. *Corresponding author: hbshen@sjtu.edu.cn Received 20 May 2009; revised 23 May 2009; accepted 1 June 2009. ABSTRACT We have developed a web-server for predicting the folding rate of a protein based on its amino acid sequence information alone. The web- server is called Pred-PFR (Predicting Protein Folding Rate). Pred-PFR is featured by fusing multiple individual predictors, each of which is established based on one special feature derived from the protein sequence. The ensemble pre- dictor thus formed is superior to the individual ones, as demonstrated by achieving higher correlation coefficient and lower root mean square deviation between the predicted and observed results when examined by the jack- knife cross-validation on a benchmark dataset constructed recently. As a user-friendly web- server, Pred-PFR is freely accessible to the public at www.csbio.sjtu.edu.cn/bioinf/Folding Rate/ . Keywords: Protein Folding Rate; Ensemble Predictor; Fusion Approach; Web-Server; Pred-PFR 1. INTRODUCTION Knowledge of protein three-dimensional (3D) structures plays an indispensable role in molecular biology, cell biology, biomedicine, and drug design [1]. However, each protein begins as a polypeptide, translated from a sequence of mRNA as a linear chain of amino acids. A protein can function properly only if it is folded into a correct shape or conformation [2]. Failure to fold into the intended 3D structure usually produces inactive proteins with different properties. Although many efforts have been made trying to understand the mechanism of protein folding (see, e.g., [3,4,5,6]), it still remains one of the most challenging problems in molecular biology. In addition to understanding how a protein chain is folded, it is also important to find the folding rates of proteins from their primary sequences. Protein chains can fold into the functional 3D structures with quite dif- ferent rates, varying from several microseconds to even an hour [7,8]. Experimentally determining the three dimensional structure of a protein is often very difficult and expensive. However the sequence of that protein is easily known. Therefore, for quite a long time, scientists have tried to use the “least free energy principle” [2,9] to predict the 3D structures of proteins. Unfortunately, owing to the notorious local energy minimum problem, so far it can only be successfully used to address very limited structural characters, such as the handedness tendency and packing arrangement in proteins (see, e.g., [10,11,12]). In the past two decades, various statistical methods have been developed for predicting the struc- tural classes of proteins and their folding patterns ac- cording to the sequence information alone (see, e.g., [13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28] and a review [29]). Encouraged by the results obtained via these statistical approaches, various methods were de- veloped for predicting the folding rates of proteins be- cause the information thus acquired would be very use- ful for understanding the protein folding mechanism and the sequence-structure-function relationship [8,30]. In this regard, the approaches can be generally categorized into two groups: (1) the prediction of protein folding rates is based on the protein structure information; and (2) the prediction is based on the primary sequence in- formation. For the first group, the features of proteins are ex- tracted from their 3D structural information and hence the predictions are feasible only after the structures have been determined. Most of the methods in this group tried to derive the statistical significance of the correlation between the protein folding rate and the corresponding structural topological parameters, such as contact order (CO) [31], absolute contact order (Abs_CO) [32], total contact distance (TCD) [33], long-range order (LRO) [34], the fraction of local contact (FLC) [34], the chain