Prediction of the Bonding States of Cysteines Using the
Support Vector Machines Based on Multiple Feature
Vectors and Cysteine State Sequences
Yu-Ching Chen,
1
Yeong-Shin Lin,
2
Chih-Jen Lin,
3
and Jenn-Kang Hwang
1,2
*
1
Institute of Bioinformatics, National Chiao Tung University, HsinChu, Taiwan, ROC
2
Department of Biological Science and Technology, National Chiao Tung University, HsinChu, Taiwan, ROC
3
Department of Computer Science, National Taiwan University, Taipei, Taiwan, ROC
ABSTRACT The support vector machine (SVM)
method is used to predict the bonding states of cys-
teines. Besides using local descriptors such as the
local sequences, we include global information, such
as amino acid compositions and the patterns of the
states of cysteines (bonded or nonbonded), or cysteine
state sequences, of the proteins. We found that SVM
based on local sequences or global amino acid compo-
sitions yielded similar prediction accuracies for the
data set comprising 4136 cysteine-containing seg-
ments extracted from 969 nonhomologous proteins.
However, the SVM method based on multiple feature
vectors (combining local sequences and global amino
acid compositions) significantly improves the predic-
tion accuracy, from 80% to 86%. If coupled with cys-
teine state sequences, SVM based on multiple feature
vectors yields 90% in overall prediction accuracy and
a 0.77 Matthews correlation coefficient, around 10%
and 22% higher than the corresponding values ob-
tained by SVM based on local sequence information.
Proteins 2004;55:1036 –1042. © 2004 Wiley-Liss, Inc.
Key words: support vector machines; disulfide
bonds; cysteine state sequences; mul-
tiple feature vectors
INTRODUCTION
The oxidation states of cysteines play a key role in both
protein structure and function.
1–8
Cysteines form disulfide
bridges to stabilize folded states by increasing favorable
enthalpy interactions in the folded states and by lowering
the entropy of the unfolded states.
9
The capability to
accurately predict the disulfide bonding states in proteins
will be useful in the study of protein stability
10
or func-
tions,
11
and in the prediction of three-dimensional (3D)
protein structures.
12
A number of computational ap-
proaches
13–17
were developed to predict the bonding states
of cysteines. Muskal et al.
13
obtained 81% prediction
accuracy by using neural networks based on the sliding
windows that include the flanking amino acids of the
centered cysteines. Fiser et al.,
14
exploiting the difference
between the sequential environments of free cysteine and
bonded cystine, developed a statistical approach that
yielded a much lower 71% prediction accuracy, though
using a data set four times bigger. Fariselli et al.
18
included evolutionary information in the form of multiple
sequence alignment and used a jury of neural networks to
obtain 81% prediction accuracy. Later, Fiser and Simon,
15
observing that cysteines of a protein tend to occur in the
same oxidation states, developed a method based on
multiple sequence alignments and achieved 82% predic-
tion accuracy. Martelli et al.
16,19
used the hidden neural
networks (HNNs) developed by Krogh and Riis,
20
and
obtained an overall prediction accuracy of 88%. Mucchielli-
Giorgi et al.
17
found that the amino acid content of the
whole protein appears to be more informative about the
disulfide bonding state than the local sequence window
does. Using a combination of logistic functions learned
with subsets of proteins homogeneous in terms of their
amino acid content, they were able to obtain prediction
accuracy close to 84% for 869 chains. The support vector
machine (SVM) method
21
has recently become popular in
computational biology.
22–26
We have previously success-
fully applied the SVM based on multiple feature vectors to
protein fold assignment
26
and subcellular localization
prediction.
27
In this work, we developed an approach to
predict the bonding states of cysteines using SVM based on
multiple feature vectors and the cysteine state sequences.
METHODS
The Support Vector Machine
Let x
i
be a local sequence window centered on the
interested cysteine or other sequence coding vectors (see
the next section), and let y
i
denote the state of the cysteine,
either bonded (y
i
= 1) or nonbonded (y
i
=-1). The SVM
technique tries to find the separating hyperplane w
T
x
i
+
b = 0 with the largest distance between two classes,
measured along a line perpendicular to this hyperplane.
However, it happens that these data to be classified may
not be linearly separable. To overcome this difficulty, the
SVM nonlinearly transforms the original input space into
a higher dimensional feature space by (x) = [
1
(x),
2
(x),…]
and tries to minimize
1
2
w
T
w+C ¥
i=1
l
i
with respect to w, b
Grant sponsor: National Science Council in Taiwan, Republic of
China (to J.-K. Hwang).
*Correspondence to: Jenn-Kang Hwang, Department of Biological
Science and Technology, National Chiao Tung University, Hsinchu
30050, Taiwan, ROC. E-mail: jkhwang@cc.nctu.edu.tw
Received 22 September 2003; Accepted 4 December 2003
Published online 16 April 2004 in Wiley InterScience
(www.interscience.wiley.com). DOI: 10.1002/prot.20079
PROTEINS: Structure, Function, and Bioinformatics 55:1036 –1042 (2004)
© 2004 WILEY-LISS, INC.