symmetry
S S
Article
Mathematical Algorithm for Identification of Eukaryotic
Promoter Sequences
Eugene V. Korotkov
1,
* , Yulia. M. Suvorova
1
, Anna V. Nezhdanova
1
, Sofia E. Gaidukova
1
, Irina V. Yakovleva
1
,
Anastasia M. Kamionskaya
1
and Maria A. Korotkova
2
Citation: Korotkov, E.V.; Suvorova,
Y..M.; Nezhdanova, A.V.; Gaidukova,
S.E.; Yakovleva, I.V.; Kamionskaya,
A.M.; Korotkova, M.A. Mathematical
Algorithm for Identification of
Eukaryotic Promoter Sequences.
Symmetry 2021, 13, 917. https:/
/doi.org/10.3390/sym13060917
Academic Editor: Laura Pop
Received: 19 April 2021
Accepted: 18 May 2021
Published: 21 May 2021
Publisher’s Note: MDPI stays neutral
with regard to jurisdictional claims in
published maps and institutional affil-
iations.
Copyright: © 2021 by the authors.
Licensee MDPI, Basel, Switzerland.
This article is an open access article
distributed under the terms and
conditions of the Creative Commons
Attribution (CC BY) license (https://
creativecommons.org/licenses/by/
4.0/).
1
Institute of Bioengineering, Federal Research Center of Biotechnology of the Russian Academy of Sciences,
119071 Moscow, Russia; suvorovay@biengi.ac.ru (Y.M.S.); anna-negdanova@mail.ru (A.V.N.);
plasmid@yandex.ru (S.E.G.); iacgea@biengi.ac.ru (I.V.Y.); akatio@biengi.ac.ru (A.M.K.)
2
Institute of Cyber Intelligence Systems, National Research Nuclear University MEPhI (Moscow Engineering
Physics Institute), 115409 Moscow, Russia; makorotkova@mephi.ru
* Correspondence: katrin2@biengi.ac.ru; Tel.: +79-26-724-8271
Abstract: Identification of promoter sequences in the eukaryotic genome, by computer methods,
is an important task of bioinformatics. However, this problem has not been solved since the best
algorithms have a false positive probability of 10
-3
–10
-4
per nucleotide. As a result of full genome
analysis, there may be more false positives than annotated gene promoters. The probability of a
false positive should be reduced to 10
-6
–10
-8
to reduce the number of false positives and increase
the reliability of the prediction. The method for multi alignment of the promoter sequences was
developed. Then, mathematical methods were developed for calculation of the statistically important
classes of the promoter sequences. Five promoter classes, from the rice genome, were created. We
developed promoter classes to search for potential promoter sequences in the rice genome with a
false positive number less than 10
-8
per nucleotide. Five classes of promoter sequences contain
1740, 222, 199, 167 and 130 promoters, respectively. A total of 145,277 potential promoter sequences
(PPSs) were identified. Of these, 18,563 are promoters of known genes, 87,233 PPSs intersect with
transposable elements, and 37,390 PPSs were found in previously unannotated sequences. The
number of false positives for a randomly mixed rice genome is less than 10
-8
per nucleotide.
The method developed for detecting PPSs was compared with some previously used approaches.
The developed mathematical method can be used to search for genes, transposable elements, and
transcript start sites in eukaryotic genomes.
Keywords: promoter; rice genome; dynamic programming; base correlation
1. Introduction
The promoter sequences, in both prokaryotes and eukaryotes, are located up to the
point of transcription initiation [1]. The site on the DNA from which the first RNA nu-
cleotide is transcribed is called the +1 site. The so-called core promoter, with a length of
60–120 bases, stands out, and RNA polymerase binds to this DNA region [2,3]. A longer
stretch of 600 bases from -499–+100 includes the core promoter, as well as the binding sites
of various transcription factors [4]. “Further, we will focus only on eukaryotic promoter
sequences. The promoter includes some motifs, which are short conservative sequences.
The so-called TATA sequence is known, which occupies positions from -31–-26 nu-
cleotides [5]. Additionally, the B recognition element is known, which is between -37 and
32 nucleotides in the promoter sequence. Short sequences have been found that provide
binding of various protein factors to the promoter sequence [6]. Many of these sequences
fall on the promoter region from +1–+40. The promoter sequence is not symmetrical” [7],
thereby making the DNA polymerase begin transcription in the right direction.
Promoter sequences are very different from each other [8–10]. This is as a result of
the need to control the transcription of various genes. When transcription is initiated, the
Symmetry 2021, 13, 917. https://doi.org/10.3390/sym13060917 https://www.mdpi.com/journal/symmetry