Species independence of mutual information in coding and noncoding DNA Ivo Grosse, 1 Hanspeter Herzel, 2 Sergey V. Buldyrev, 1 and H. Eugene Stanley 1 1 Center for Polymer Studies and Department of Physics, Boston University, Boston, Massachusetts 02215 2 Institute for Theoretical Biology, Humboldt University, Invalidenstrasse, 43, 10115 Berlin, Germany Received 29 October 1999 We explore if there exist universal statistical patterns that are different in coding and noncoding DNA and can be found in all living organisms, regardless of their phylogenetic origin. We find that ithe mutual information function I has a significantly different functional form in coding and noncoding DNA. We further find that iithe probability distributions of the average mutual information I ¯ are significantly different in coding and noncoding DNA, while iiithey are almost the same for organisms of all taxonomic classes. Surprisingly, we find that I ¯ is capable of predicting coding regions as accurately as organism-specific coding measures. PACS numbers: 87.10.+e, 02.50.-r, 05.40.-a I. INTRODUCTION DNA carries the genetic information of most living organ- isms, and the goal of genome projects is to uncover that genetic information. Hence, genomes of many different spe- cies, ranging from simple bacteria to complex vertebrates, are currently being sequenced. As automated sequencing techniques have started to produce a rapidly growing amount of raw DNA sequences, the extraction of information from these sequences becomes a scientific challenge. A large frac- tion of an organism’s DNA is not used for encoding proteins 1. Hence, one basic task in the analysis of DNA sequences is the identification of coding regions. Since biochemical techniques alone are not sufficient for identifying all coding regions in every genome, researchers from many fields have been attempting to find statistical patterns that are different in coding and noncoding DNA 2–6. Such patterns have been found, but none seems to be species independent. Hence, traditional coding measures 7based on these pat- terns need to be trained on organism-specific data sets before they can be applied to identify coding DNA. This training- set dependence limits the applicability of traditional coding measures, as many new genomes are currently being se- quenced for which training sets do not exist. II. MUTUAL INFORMATION FUNCTION In search for species-independent statistical patterns that are different in coding and noncoding DNA, we study the mutual information function I( k ), which quantifies the amount of information in units of bitsthat can be obtained from one nucleotide X about another nucleotide Y that is located k nucleotides downstream from X 8. Within the framework of statistical mechanics I can be interpreted as follows. Consider a compound system X,Yconsisting of the two subsystems X and Y. Let p i denote the probability of finding system X in state i, let q j denote the probability of finding system Y in state j, and let P ij denote the joint prob- ability of finding the compound system X,Yin state i,j. Then the entropies of the systems X,Y, and X,Yare defined by HX -k B i p i ln p i , HY -k B j q j ln q j , and HX , Y -k B i , j P ij ln P ij , where k B denotes the Boltzmann constant. If X and Y are statistically independent, then HX +HY =HX , Y , which states that the Boltzmann entropy is extensive. If X and Y are statistically dependent, then the sum of the entro- FIG. 1. Mutual information function, I( k ), of human coding thin lineand noncoding thick lineDNA, from GenBank release 111 Ref. 10. We cut all human, non-mitochondrial DNA se- quences into non-overlapping fragments of length 500 bp, starting at the 5 ' -end. We compute the mutual information function of each fragment, correct for the finite length effect Ref. 13, and display the average over all mutual information functions of coding and noncoding DNA separately. We find that for noncoding DNA I( k ) decays to zero as k increases, while for coding DNA I( k ) shows persistent period-3 oscillations. PHYSICAL REVIEW E MAY 2000 VOLUME 61, NUMBER 5 PRE 61 1063-651X/2000/615/56246/$15.00 5624 ©2000 The American Physical Society