Structure-Related Statistical Singularities along Protein Sequences: A Correlation Study
Mauro Colafranceschi,
†
Alfredo Colosimo,
†
Joseph P. Zbilut,
‡
Vladimir N. Uversky,
§
and
Alessandro Giuliani*
,|
Department of Human Physiology and Pharmacology - University of Rome “La Sapienza”, P.le A. Moro,
5-00185 Rome, Italy, Department of Molecular Biophysics and Physiology, Rush Medical College,
Chicago, Illinois 60612, Department of Chemistry and Biochemistry, University of California,
Santa Cruz, California 95064, Institute for Biological Instrumentation of the Russian Academy of Sciences,
Pushchino, Moscow Region, 142290 Russia, and Environment and Health Department - Istituto Superiore di
Sanita `, Viale Regina Elena, 299-00161 Rome, Italy
Received May 18, 2004
A data set composed of 1141 proteins representative of all eukaryotic protein sequences in the Swiss-Prot
Protein Knowledge base was coded by seven physicochemical properties of amino acid residues. The resulting
numerical profiles were submitted to correlation analysis after the application of a linear (simple mean) and
a nonlinear (Recurrence Quantification Analysis, RQA) filter. The main RQA variables, Recurrence and
Determinism, were subsequently analyzed by Principal Component Analysis. The RQA descriptors showed
that (i) within protein sequences is embedded specific information neither present in the codes nor in the
amino acid composition and (ii) the most sensitive code for detecting ordered recurrent (deterministic) patterns
of residues in protein sequences is the Miyazawa-Jernigan hydrophobicity scale. The most deterministic
proteins in terms of autocorrelation properties of primary structures were found (i) to be involved in protein-
protein and protein-DNA interactions and (ii) to display a significantly higher proportion of structural
disorder with respect to the average data set. A study of the scaling behavior of the average determinism
with the setting parameters of RQA (embedding dimension and radius) allows for the identification of patterns
of minimal length (six residues) as possible markers of zones specifically prone to inter- and intramolecular
interactions.
1. INTRODUCTION
Protein sequences are, with rare exceptions (e.g. fibrous
polymerizing proteins such as collagen or silk), quasi-random
strings of symbols with scant evidence of order or periodic-
ity: a reliable estimate of the entropy reduction due to the
autocorrelation of residues in an average protein sequence
is only about 1%.
1
Nevertheless, such quasi-random strings
are the basic recipes producing refined three-dimensional
structures, which sustain sophisticated dynamics along with
specific physiological roles. Thus, the observed quasi-
randomness may be a specious image for underlying mean-
ing. It is interesting to note that a similar situation occurs in
the case of human languages where it is almost impossible
to generate meaningful texts using just periodic repetitions
of symbols.
2
There is, however, a fundamental difference
between linguistic rules and the rules governing sequence/
structure/activity of proteins: in human languages the linkage
between the strings of characters (words) and their semantic
meaning is completely arbitrary and needs an external
intelligent and active receiver to be decoded. Amino acid
sequences, on the other hand, are translated into biologically
meaningful messages in the form of proteins by the physi-
cochemical environment (e.g., ionic strength, relative hy-
drophobicity, temperature, pressure).
3-5
A focus on the numerical series of physicochemical
properties of amino acid residues has provided interesting
results in the study of specific protein behavior.
6,7
At the
same time, the quasi-random qualification of symbolically
coded protein sequences evokes the possibility of solving
the sequence/structure/activity puzzle by discovering subtle,
albeit crucial regularities in the juxtaposition of symbols. The
importance of such regularities in decoding signals can be
better appreciated if one recalls the development of cryp-
tography during the Second World War. The decipherment
of hidden information in encrypted messages was based upon
the notion that any human language, despite its apparent
randomness and arbitrariness, is endowed with regularities
of various kinds (e.g. the relative abundance of words of
given length, the juxtaposition of pairs of symbols, etc.) and
that no “masking code” can obscure the code-independent
features typical of the original language.
8
In the present study we adopt the following assumption:
the various physicochemical code of amino acid residues are
considered as masking codes, and the distinction between
“code-dependent” and “code-independent” regularities is used
to highlight some statistical features of amino acid patterns.
The assumption derives from the fact that a well defined set
of physicochemical rules are able to unambiguously trans-
form a given amino acid sequence into a 3D molecular
structure. Hence, any physicochemical code can be consid-
* Corresponding author phone: ++39 06 49902579; fax: ++39 06
49902355; e-mail: alessandro.giuliani@iss.it.
†
University of Rome “La Sapienza.
‡
Rush Medical College.
§
University of California and Institute for Biological Instrumentation
of the Russian Academy of Sciences.
|
Istituto Superiore di Sanita `.
183 J. Chem. Inf. Model. 2005, 45, 183-189
10.1021/ci049838m CCC: $30.25 © 2005 American Chemical Society
Published on Web 11/24/2004