0026-8933/03/3703- $25.00 © 2003 MAIK “Nauka / Interperiodica” 0372 Molecular Biology, Vol. 37, No. 3, 2003, pp. 372–386. Translated from Molekulyarnaya Biologiya, Vol. 37, No. 3, 2003, pp. 436–451. Original Russian Text Copyright © 2003 by Korotkov, Korotkova, Frenkel, Kudryashov. INTRODUCTION Development of mathematical methods to study periodicities in symbol sequences is of significant importance nowadays. This is primarily connected with the success in sequencing of different genomes, as well as with the appearance of a great number of decoded amino acid sequences [1–7]. Therefore, the problem arose how to determine the structural charac- teristics of these sequences and to find out their bio- logical meaning. One such characteristic is periodicity of symbol sequences. To study the periodicity of continual and discrete numerical sequences, a substantial body of mathemat- ics has been developed, using Fourier analysis and allowing one to determine the spectral density of a sequence [8]. The same approach was intensively applied thereafter to reveal periodicity in symbol sequences. However, to apply Fourier transform, one should represent a symbol sequence as a numerical one, unambiguously reflecting the characteristics of the symbol sequence. Direct transformation of a sym- bol text by simply replacing symbols with numbers allows no adequate representation of the symbol sequence, in fact introducing weights for symbols, leading to distortion of the statistical characteristics of an initial symbol sequence. Several approaches have been applied to solve the problem [9–19]. The widely used one is construction of a symbol sequence in an alphabet A = {a 1 , a 2 , …, a m }, where m is the size of the alphabet for the symbol sequence, comprising m sequences of zeros and unities formed according to the following law: x(i, j ) = 1 if symbol a i takes posi- tion j and (i, j ) = 0 in all other cases. Then Fourier transform is applied for each of such numerical sequences, and Fourier harmonics corresponding to symbols of type i are determined, as well as matrix structural factors corresponding to pairwise symbol correlations [13]. The final spectral density is usually built taking into account the statistical characteristics of spectral density calculated for random sequences [13]. In our opinion, this method works well enough for studying relatively short periodicity in symbol sequences (of length smaller than the alphabet size for the sequence). In the case of periods of length greater than the alphabet size, “ransacking” of statistical sig- nificance is possible for the longer periods in favor of the shorter ones. For example, for the symbol sequence obtained by 50 repeats of YRTDFT, there are five numerical sequences consisting of 0 and 1 (according to the alphabet used). In this case, the Fou- rier harmonics will demonstrate a period of six sym- bols for letters Y, R, D, and F; but of three for letter T. Thereby the statistical significance of the six-letter period will be decreased by the significance value of the three-letter period. The effect will be the stronger, the greater is the ratio of the period length to the alphabet size. Thus, the statistical significance of the longer period is sort of “smeared” throughout the sta- tistical significance of the shorter periods, i.e., the damping effect is observed for harmonics with long periods in favor of harmonics with the shorter periods. The effect will be even stronger if there are substitu- BIOINFORMATICS The Informational Concept of Searching for Periodicity in Symbol Sequences E. V. Korotkov 1 , M. A. Korotkova 2 , F. E. Frenkel 1 , and N. A. Kudryashov 2 1 Center of Bioengineering, Russian Academy of Sciences, Moscow, 117312 Russia; E-mail: katrin2@mail.ru, katrin22@mtu-net.ru 2 Moscow Engineering and Physical Institute, Moscow, 115409 Russia Received May 25, 2002 Abstract—A method of informational decomposition has been developed, allowing one to reveal hidden peri- odicity in any symbol sequence. The informational decomposition is calculated without conversion of a symbol sequence into a numerical one, which facilitates finding periodicities in a symbol sequence. The method permits introducing an analog of the autocorrelation function of a symbol sequence. The method developed by us has been applied to reveal hidden periodicities in nucleotide and amino acid sequences, as well as in different poet- ical texts. Hidden periodicity has been detected in various genes, testifying to their quantum structure. The func- tional and structural role of hidden periodicity is discussed. Key words: autocorrelation function, hidden periodicity, symbol sequences, structure of sequences, genes and proteins UDC 577.212.2