Species independence of mutual information in coding and noncoding DNA
Ivo Grosse,
1
Hanspeter Herzel,
2
Sergey V. Buldyrev,
1
and H. Eugene Stanley
1
1
Center for Polymer Studies and Department of Physics, Boston University, Boston, Massachusetts 02215
2
Institute for Theoretical Biology, Humboldt University, Invalidenstrasse, 43, 10115 Berlin, Germany
Received 29 October 1999
We explore if there exist universal statistical patterns that are different in coding and noncoding DNA and
can be found in all living organisms, regardless of their phylogenetic origin. We find that i the mutual
information function I has a significantly different functional form in coding and noncoding DNA. We further
find that ii the probability distributions of the average mutual information I
¯
are significantly different in
coding and noncoding DNA, while iii they are almost the same for organisms of all taxonomic classes.
Surprisingly, we find that I
¯
is capable of predicting coding regions as accurately as organism-specific coding
measures.
PACS numbers: 87.10.+e, 02.50.-r, 05.40.-a
I. INTRODUCTION
DNA carries the genetic information of most living organ-
isms, and the goal of genome projects is to uncover that
genetic information. Hence, genomes of many different spe-
cies, ranging from simple bacteria to complex vertebrates,
are currently being sequenced. As automated sequencing
techniques have started to produce a rapidly growing amount
of raw DNA sequences, the extraction of information from
these sequences becomes a scientific challenge. A large frac-
tion of an organism’s DNA is not used for encoding proteins
1. Hence, one basic task in the analysis of DNA sequences
is the identification of coding regions. Since biochemical
techniques alone are not sufficient for identifying all coding
regions in every genome, researchers from many fields have
been attempting to find statistical patterns that are different
in coding and noncoding DNA 2–6. Such patterns have
been found, but none seems to be species independent.
Hence, traditional coding measures 7 based on these pat-
terns need to be trained on organism-specific data sets before
they can be applied to identify coding DNA. This training-
set dependence limits the applicability of traditional coding
measures, as many new genomes are currently being se-
quenced for which training sets do not exist.
II. MUTUAL INFORMATION FUNCTION
In search for species-independent statistical patterns that
are different in coding and noncoding DNA, we study the
mutual information function I( k ), which quantifies the
amount of information in units of bits that can be obtained
from one nucleotide X about another nucleotide Y that is
located k nucleotides downstream from X 8. Within the
framework of statistical mechanics I can be interpreted as
follows. Consider a compound system X,Y consisting of the
two subsystems X and Y. Let p
i
denote the probability of
finding system X in state i, let q
j
denote the probability of
finding system Y in state j, and let P
ij
denote the joint prob-
ability of finding the compound system X,Y in state i,j.
Then the entropies of the systems X,Y, and X,Y are defined
by
H X -k
B
i
p
i
ln p
i
,
H Y -k
B
j
q
j
ln q
j
, and
H X , Y -k
B
i , j
P
ij
ln P
ij
,
where k
B
denotes the Boltzmann constant. If X and Y are
statistically independent, then H X +H Y =H X , Y ,
which states that the Boltzmann entropy is extensive. If X
and Y are statistically dependent, then the sum of the entro-
FIG. 1. Mutual information function, I( k ), of human coding
thin line and noncoding thick line DNA, from GenBank release
111 Ref. 10. We cut all human, non-mitochondrial DNA se-
quences into non-overlapping fragments of length 500 bp, starting
at the 5 ' -end. We compute the mutual information function of each
fragment, correct for the finite length effect Ref. 13, and display
the average over all mutual information functions of coding and
noncoding DNA separately. We find that for noncoding DNA I( k )
decays to zero as k increases, while for coding DNA I( k ) shows
persistent period-3 oscillations.
PHYSICAL REVIEW E MAY 2000 VOLUME 61, NUMBER 5
PRE 61 1063-651X/2000/615/56246/$15.00 5624 ©2000 The American Physical Society