Prediction of the Bonding States of Cysteines Using the Support Vector Machines Based on Multiple Feature Vectors and Cysteine State Sequences Yu-Ching Chen, 1 Yeong-Shin Lin, 2 Chih-Jen Lin, 3 and Jenn-Kang Hwang 1,2 * 1 Institute of Bioinformatics, National Chiao Tung University, HsinChu, Taiwan, ROC 2 Department of Biological Science and Technology, National Chiao Tung University, HsinChu, Taiwan, ROC 3 Department of Computer Science, National Taiwan University, Taipei, Taiwan, ROC ABSTRACT The support vector machine (SVM) method is used to predict the bonding states of cys- teines. Besides using local descriptors such as the local sequences, we include global information, such as amino acid compositions and the patterns of the states of cysteines (bonded or nonbonded), or cysteine state sequences, of the proteins. We found that SVM based on local sequences or global amino acid compo- sitions yielded similar prediction accuracies for the data set comprising 4136 cysteine-containing seg- ments extracted from 969 nonhomologous proteins. However, the SVM method based on multiple feature vectors (combining local sequences and global amino acid compositions) significantly improves the predic- tion accuracy, from 80% to 86%. If coupled with cys- teine state sequences, SVM based on multiple feature vectors yields 90% in overall prediction accuracy and a 0.77 Matthews correlation coefficient, around 10% and 22% higher than the corresponding values ob- tained by SVM based on local sequence information. Proteins 2004;55:1036 –1042. © 2004 Wiley-Liss, Inc. Key words: support vector machines; disulfide bonds; cysteine state sequences; mul- tiple feature vectors INTRODUCTION The oxidation states of cysteines play a key role in both protein structure and function. 1–8 Cysteines form disulfide bridges to stabilize folded states by increasing favorable enthalpy interactions in the folded states and by lowering the entropy of the unfolded states. 9 The capability to accurately predict the disulfide bonding states in proteins will be useful in the study of protein stability 10 or func- tions, 11 and in the prediction of three-dimensional (3D) protein structures. 12 A number of computational ap- proaches 13–17 were developed to predict the bonding states of cysteines. Muskal et al. 13 obtained 81% prediction accuracy by using neural networks based on the sliding windows that include the flanking amino acids of the centered cysteines. Fiser et al., 14 exploiting the difference between the sequential environments of free cysteine and bonded cystine, developed a statistical approach that yielded a much lower 71% prediction accuracy, though using a data set four times bigger. Fariselli et al. 18 included evolutionary information in the form of multiple sequence alignment and used a jury of neural networks to obtain 81% prediction accuracy. Later, Fiser and Simon, 15 observing that cysteines of a protein tend to occur in the same oxidation states, developed a method based on multiple sequence alignments and achieved 82% predic- tion accuracy. Martelli et al. 16,19 used the hidden neural networks (HNNs) developed by Krogh and Riis, 20 and obtained an overall prediction accuracy of 88%. Mucchielli- Giorgi et al. 17 found that the amino acid content of the whole protein appears to be more informative about the disulfide bonding state than the local sequence window does. Using a combination of logistic functions learned with subsets of proteins homogeneous in terms of their amino acid content, they were able to obtain prediction accuracy close to 84% for 869 chains. The support vector machine (SVM) method 21 has recently become popular in computational biology. 22–26 We have previously success- fully applied the SVM based on multiple feature vectors to protein fold assignment 26 and subcellular localization prediction. 27 In this work, we developed an approach to predict the bonding states of cysteines using SVM based on multiple feature vectors and the cysteine state sequences. METHODS The Support Vector Machine Let x i be a local sequence window centered on the interested cysteine or other sequence coding vectors (see the next section), and let y i denote the state of the cysteine, either bonded (y i = 1) or nonbonded (y i =-1). The SVM technique tries to find the separating hyperplane w T x i + b = 0 with the largest distance between two classes, measured along a line perpendicular to this hyperplane. However, it happens that these data to be classified may not be linearly separable. To overcome this difficulty, the SVM nonlinearly transforms the original input space into a higher dimensional feature space by (x) = [ 1 (x), 2 (x),…] and tries to minimize 1 2 w T w+C ¥ i=1 l i with respect to w, b Grant sponsor: National Science Council in Taiwan, Republic of China (to J.-K. Hwang). *Correspondence to: Jenn-Kang Hwang, Department of Biological Science and Technology, National Chiao Tung University, Hsinchu 30050, Taiwan, ROC. E-mail: jkhwang@cc.nctu.edu.tw Received 22 September 2003; Accepted 4 December 2003 Published online 16 April 2004 in Wiley InterScience (www.interscience.wiley.com). DOI: 10.1002/prot.20079 PROTEINS: Structure, Function, and Bioinformatics 55:1036 –1042 (2004) © 2004 WILEY-LISS, INC.