Computational Biology and Chemistry 53 (2014) 15–25
Contents lists available at ScienceDirect
Computational Biology and Chemistry
journal homepage: www.elsevier.com/locate/compbiolchem
Bacterial genomes lacking long-range correlations may not be
modeled by low-order Markov chains: The role of mixing statistics
and frame shift of neighboring genes
Germinal Cocho
a
, Pedro Miramontes
b,*
, Ricardo Mansilla
c
, Wentian Li
d,*
a
Departamento de Sistemas Complejos, Instituto de Física, Universidad Nacional Autonoma de Mexico, Ciudad Universitaria, Mexico 04510, DF, Mexico
b
Facultad de Ciencias, Universidad Nacional Autónoma de México, Ciudad Universitaria, México 04510, DF, Mexico
c
Centro de Investigaciones Interdisciplinarias en Ciencias y Hamanidades, Universidad Nacional Autónoma de México, Ciudad Universitaria, Mexico 04510,
DF, Mexico
d
The Robert S. Boas Center for Genomics and Human Genetics, The Feinstein Institute for Medical Research, North Shore LIJ Health System,
Manhasset, NY, USA
article info
Article history:
Available online 30 August 2014
Keywords:
Bacterial genomes
Exponential correlation function
Markov model
Second largest eigenvalue
Hexamer
Periodicity of 10–11 bases
Heterogeneity
Codon positions
abstract
We examine the relationship between exponential correlation functions and Markov models in a bacterial
genome in detail. Despite the well known fact that Markov models generate sequences with correla-
tion function that decays exponentially, simply constructed Markov models based on nearest-neighbor
dimer (first-order), trimer (second-order), up to hexamer (fifth-order), and treating the DNA sequence
as being homogeneous all fail to predict the value of exponential decay rate. Even reading-frame-specific
Markov models (both first- and fifth-order) could not explain the fact that the exponential decay is very
slow. Starting with the in-phase coding-DNA-sequence (CDS), we investigated correlation within a fixed-
codon-position subsequence, and in artificially constructed sequences by packing CDSs with out-of-phase
spacers, as well as altering CDS length distribution by imposing an upper limit. From these targeted anal-
yses, we conclude that the correlation in the bacterial genomic sequence is mainly due to a mixing of
heterogeneous statistics at different codon positions, and the decay of correlation is due to the possible
out-of-phase between neighboring CDSs. There are also small contributions to the correlation from bases
at the same codon position, as well as by non-coding sequences. These show that the seemingly simple
exponential correlation functions in bacterial genome hide a complexity in correlation structure which is
not suitable for a modeling by Markov chain in a homogeneous sequence. Other results include: use of the
(absolute value) second largest eigenvalue to represent the 16 correlation functions and the prediction
of a 10–11 base periodicity from the hexamer frequencies.
© 2014 Elsevier Ltd. All rights reserved.
1. Introduction
Long-range correlations often refer to a power-law correlation
function, as versus short-range correlations referring in exponen-
tial correlation function. Many genomes, when a chromosome is
treated as a sequence of symbols or numerical values, exhibit
power-law long-range correlations (Li, 1997a; Buldyrev, 2006;
Arneodo et al., 2011). More interestingly, the type of long-range
correlations in genomes share similarity with the “1/f noise” time
series (Li and Kaneko, 1992; Voss, 1992; Li et al., 1998; Li and Holste,
2005). Not all genomes exhibit power-law correlation functions,
*
Corresponding authors.
E-mail addresses: pmv@ciencias.unam.mx (P. Miramontes),
wtli2012@gmail.com (W. Li).
however – the bacteria genomes tend to exhibit 1/f
2
spectra (Li,
1997b) and exponential correlation functions (Bernaola-Galván
et al., 2002).
There are many mathematical models of sequences with power-
law correlations (Beran, 1994; Beran et al., 2014). Although there
are attempts to propose a universal framework for all observed
power-laws (Peterson et al., 2013), the mechanical model of
any specific dataset with power-law distributions could be non-
universal and not applicable to other datasets (Sornette, 2006).
For example, many long-range correlations of complex genomes
may be caused by large domains with differential base composi-
tions, whose size follow a broad or even long-tailed distribution
(Bernaola-Galván et al., 1996; Clay et al., 2001).
The range of mathematical models of sequences with exponen-
tial correlation function, on the other hand, is relatively narrow.
Markov chains are almost always used as the generating model.
http://dx.doi.org/10.1016/j.compbiolchem.2014.08.005
1476-9271/© 2014 Elsevier Ltd. All rights reserved.