0026-8933/03/3703- $25.00 © 2003 MAIK “Nauka / Interperiodica” 0372
Molecular Biology, Vol. 37, No. 3, 2003, pp. 372–386. Translated from Molekulyarnaya Biologiya, Vol. 37, No. 3, 2003, pp. 436–451.
Original Russian Text Copyright © 2003 by Korotkov, Korotkova, Frenkel, Kudryashov.
INTRODUCTION
Development of mathematical methods to study
periodicities in symbol sequences is of significant
importance nowadays. This is primarily connected
with the success in sequencing of different genomes,
as well as with the appearance of a great number of
decoded amino acid sequences [1–7]. Therefore, the
problem arose how to determine the structural charac-
teristics of these sequences and to find out their bio-
logical meaning. One such characteristic is periodicity
of symbol sequences.
To study the periodicity of continual and discrete
numerical sequences, a substantial body of mathemat-
ics has been developed, using Fourier analysis and
allowing one to determine the spectral density of a
sequence [8]. The same approach was intensively
applied thereafter to reveal periodicity in symbol
sequences. However, to apply Fourier transform, one
should represent a symbol sequence as a numerical
one, unambiguously reflecting the characteristics of
the symbol sequence. Direct transformation of a sym-
bol text by simply replacing symbols with numbers
allows no adequate representation of the symbol
sequence, in fact introducing weights for symbols,
leading to distortion of the statistical characteristics of
an initial symbol sequence. Several approaches have
been applied to solve the problem [9–19]. The widely
used one is construction of a symbol sequence in an
alphabet A = {a
1
, a
2
, …, a
m
}, where m is the size of the
alphabet for the symbol sequence, comprising m
sequences of zeros and unities formed according to
the following law: x(i, j ) = 1 if symbol a
i
takes posi-
tion j and (i, j ) = 0 in all other cases. Then Fourier
transform is applied for each of such numerical
sequences, and Fourier harmonics corresponding to
symbols of type i are determined, as well as matrix
structural factors corresponding to pairwise symbol
correlations [13]. The final spectral density is usually
built taking into account the statistical characteristics
of spectral density calculated for random sequences [13].
In our opinion, this method works well enough for
studying relatively short periodicity in symbol
sequences (of length smaller than the alphabet size for
the sequence). In the case of periods of length greater
than the alphabet size, “ransacking” of statistical sig-
nificance is possible for the longer periods in favor of
the shorter ones. For example, for the symbol
sequence obtained by 50 repeats of YRTDFT, there are
five numerical sequences consisting of 0 and 1
(according to the alphabet used). In this case, the Fou-
rier harmonics will demonstrate a period of six sym-
bols for letters Y, R, D, and F; but of three for letter T.
Thereby the statistical significance of the six-letter
period will be decreased by the significance value of
the three-letter period. The effect will be the stronger,
the greater is the ratio of the period length to the
alphabet size. Thus, the statistical significance of the
longer period is sort of “smeared” throughout the sta-
tistical significance of the shorter periods, i.e., the
damping effect is observed for harmonics with long
periods in favor of harmonics with the shorter periods.
The effect will be even stronger if there are substitu-
BIOINFORMATICS
The Informational Concept of Searching for Periodicity
in Symbol Sequences
E. V. Korotkov
1
, M. A. Korotkova
2
, F. E. Frenkel
1
, and N. A. Kudryashov
2
1
Center of Bioengineering, Russian Academy of Sciences, Moscow, 117312 Russia;
E-mail: katrin2@mail.ru, katrin22@mtu-net.ru
2
Moscow Engineering and Physical Institute, Moscow, 115409 Russia
Received May 25, 2002
Abstract—A method of informational decomposition has been developed, allowing one to reveal hidden peri-
odicity in any symbol sequence. The informational decomposition is calculated without conversion of a symbol
sequence into a numerical one, which facilitates finding periodicities in a symbol sequence. The method permits
introducing an analog of the autocorrelation function of a symbol sequence. The method developed by us has
been applied to reveal hidden periodicities in nucleotide and amino acid sequences, as well as in different poet-
ical texts. Hidden periodicity has been detected in various genes, testifying to their quantum structure. The func-
tional and structural role of hidden periodicity is discussed.
Key words: autocorrelation function, hidden periodicity, symbol sequences, structure of sequences, genes and
proteins
UDC 577.212.2