Use of Variable Selection in Modeling the Secondary Structural Content of Proteins
from Their Composition of Amino Acid Residues
Teuta Piliz ˇota,
²
Bono Luc ˇic ´,* and Nenad Trinajstic ´
The Rugjer Bos ˇkovic ´ Institute, P.O. Box 180, HR-10002 Zagreb, Croatia
Received February 27, 2003
The possibility of prediction of protein secondary structure content from composition of their amino acid
residues can help in bridging the gap between proteins of known primary sequence having an unknown
secondary structure. Almost all recently published models for understanding the relationship between
composition (frequency of occurrence) of amino acid residues and secondary structure content of proteins
involved composition of all 20 amino acid residues. However, it is well-known that many amino acid residues
are mutually similar according to their physicochemical properties (hydrophobicity, hydrophilicity, charge,
size, etc.). Because of that, we were motivated to investigate the possibility of reduction of the total number
of terms (frequencies of amino acid residues) in the models for describing the relation between the composition
of amino acid residues and the percentage of residues belonging to R, , and coil secondary structure. For
this purpose, the CROMRsel algorithm (J. Chem. Inf. Comput. Sci. 1999, 39, 121-132) for selection of a
small subset of the most important variables/descriptors into the multiregression (MR) models, i.e., frequency
of occurrence of amino acid residues in proteins, was used. Analysis was performed on a data set containing
475 proteins, taken from Proteins 1996, 25, 157-168. A complete data set was partitioned into a 317-
protein training set and 158-protein test set. The best possible linear models containing I ) 1, ..., 20 frequencies
were selected among all 20 frequencies of occurrence of amino acid residues on the 317-protein training
set, and were used for performing prediction of the corresponding percentage of secondary structure content
on the 158-protein test set. For the 317-protein data set the best selected concise models for the R, , and
coil secondary structure contain only 9, 5, and 8 frequencies, respectively. Selected concise models are of
the same or better fitted, cross-validated, and predictive statistical parameters than the models containing
all 20 frequencies. Additionally, for each I (I ) 1, ..., 20) 30 the best possible random models were selected.
In each case, the best possible real models are much better than each of the best possible random models,
showing clearly that there is no risk of a chance correlation (what one could expect due to the application
of an exhaustive search for the best model having I frequencies among all 20!/I!(20-I)! possible models).
Finally, the best selected models on the complete 475-protein data set for the R, , and coil secondary
structure contain only 7, 4, and 7 frequencies of amino acid residues, respectively. These models are much
simpler and have better fitted and cross-validated errors than the corresponding models from the literature,
that were obtained without using a procedure for selection of the most important frequencies of amino acid
residues in proteins.
INTRODUCTION
Measurement or prediction of the secondary structure
content (percentages of R-helix (R), -strand (), and coil)
of a soluble protein can be considered as the first step in
getting information on its structure. The protein secondary
structure content can be experimentally determined by
circular dichroism (CD) spectroscopy in the UV absorption
range
1
and IR Raman spectroscopy.
2
In some cases, the
accuracy of these experimental methods is not satisfactory.
In addition, there are no general methods suitable for every
protein.
3
Almost all published theoretical approaches were
developed by the multiregression technique and use protein
primary sequence information solely, or together with
physical and chemical properties of amino acid residues (refs
3-6 and references therein). Additional descriptors, derived
from the protein sequence information and properties of
amino acid residues, squares, or cross-products of initial
frequencies of amino acid residues, that improve models for
predicting the protein secondary structure content, were
added to the frequency of occurrence of 20 amino acid
residues (which have been used in each model).
However, all of these models contain a lot of optimized
parameters corresponding to a large number of included
descriptors, and, consequently, they are of limited accuracy
in predicting the secondary structure content of a new set of
proteins. In addition, it is not easy to interpret such models
because they are complex and include several strongly
intercorrelated descriptors related to very similar amino acid
residues (which have similar physical and chemical proper-
ties). From the similarity analysis of protein sequence
databases, we know that protein structures are redundant (i.e.,
a large portion of protein primary sequence may be irrelevant
for protein function), and that many point mutations do not
change protein structure and function. Redundancy of protein
* Corresponding author phone: ++385-1-4680095; fax: ++385-1-
4680-245; e-mail: lucic@irb.hr.
²
Present address: The Clarendon Laboratory, Department of Physics,
University of Oxford, South Parks Road, Oxford OX1 3PU.
113 J. Chem. Inf. Comput. Sci. 2004, 44, 113-121
10.1021/ci034037p CCC: $27.50 © 2004 American Chemical Society
Published on Web 11/12/2003