Structure-Related Statistical Singularities along Protein Sequences: A Correlation Study Mauro Colafranceschi, † Alfredo Colosimo, † Joseph P. Zbilut, ‡ Vladimir N. Uversky, § and Alessandro Giuliani* ,| Department of Human Physiology and Pharmacology - University of Rome “La Sapienza”, P.le A. Moro, 5-00185 Rome, Italy, Department of Molecular Biophysics and Physiology, Rush Medical College, Chicago, Illinois 60612, Department of Chemistry and Biochemistry, University of California, Santa Cruz, California 95064, Institute for Biological Instrumentation of the Russian Academy of Sciences, Pushchino, Moscow Region, 142290 Russia, and Environment and Health Department - Istituto Superiore di Sanita `, Viale Regina Elena, 299-00161 Rome, Italy Received May 18, 2004 A data set composed of 1141 proteins representative of all eukaryotic protein sequences in the Swiss-Prot Protein Knowledge base was coded by seven physicochemical properties of amino acid residues. The resulting numerical profiles were submitted to correlation analysis after the application of a linear (simple mean) and a nonlinear (Recurrence Quantification Analysis, RQA) filter. The main RQA variables, Recurrence and Determinism, were subsequently analyzed by Principal Component Analysis. The RQA descriptors showed that (i) within protein sequences is embedded specific information neither present in the codes nor in the amino acid composition and (ii) the most sensitive code for detecting ordered recurrent (deterministic) patterns of residues in protein sequences is the Miyazawa-Jernigan hydrophobicity scale. The most deterministic proteins in terms of autocorrelation properties of primary structures were found (i) to be involved in protein- protein and protein-DNA interactions and (ii) to display a significantly higher proportion of structural disorder with respect to the average data set. A study of the scaling behavior of the average determinism with the setting parameters of RQA (embedding dimension and radius) allows for the identification of patterns of minimal length (six residues) as possible markers of zones specifically prone to inter- and intramolecular interactions. 1. INTRODUCTION Protein sequences are, with rare exceptions (e.g. fibrous polymerizing proteins such as collagen or silk), quasi-random strings of symbols with scant evidence of order or periodic- ity: a reliable estimate of the entropy reduction due to the autocorrelation of residues in an average protein sequence is only about 1%. 1 Nevertheless, such quasi-random strings are the basic recipes producing refined three-dimensional structures, which sustain sophisticated dynamics along with specific physiological roles. Thus, the observed quasi- randomness may be a specious image for underlying mean- ing. It is interesting to note that a similar situation occurs in the case of human languages where it is almost impossible to generate meaningful texts using just periodic repetitions of symbols. 2 There is, however, a fundamental difference between linguistic rules and the rules governing sequence/ structure/activity of proteins: in human languages the linkage between the strings of characters (words) and their semantic meaning is completely arbitrary and needs an external intelligent and active receiver to be decoded. Amino acid sequences, on the other hand, are translated into biologically meaningful messages in the form of proteins by the physi- cochemical environment (e.g., ionic strength, relative hy- drophobicity, temperature, pressure). 3-5 A focus on the numerical series of physicochemical properties of amino acid residues has provided interesting results in the study of specific protein behavior. 6,7 At the same time, the quasi-random qualification of symbolically coded protein sequences evokes the possibility of solving the sequence/structure/activity puzzle by discovering subtle, albeit crucial regularities in the juxtaposition of symbols. The importance of such regularities in decoding signals can be better appreciated if one recalls the development of cryp- tography during the Second World War. The decipherment of hidden information in encrypted messages was based upon the notion that any human language, despite its apparent randomness and arbitrariness, is endowed with regularities of various kinds (e.g. the relative abundance of words of given length, the juxtaposition of pairs of symbols, etc.) and that no “masking code” can obscure the code-independent features typical of the original language. 8 In the present study we adopt the following assumption: the various physicochemical code of amino acid residues are considered as masking codes, and the distinction between “code-dependent” and “code-independent” regularities is used to highlight some statistical features of amino acid patterns. The assumption derives from the fact that a well defined set of physicochemical rules are able to unambiguously trans- form a given amino acid sequence into a 3D molecular structure. Hence, any physicochemical code can be consid- * Corresponding author phone: ++39 06 49902579; fax: ++39 06 49902355; e-mail: alessandro.giuliani@iss.it. † University of Rome “La Sapienza. ‡ Rush Medical College. § University of California and Institute for Biological Instrumentation of the Russian Academy of Sciences. | Istituto Superiore di Sanita `. 183 J. Chem. Inf. Model. 2005, 45, 183-189 10.1021/ci049838m CCC: $30.25 © 2005 American Chemical Society Published on Web 11/24/2004