Computer Physics Communications 121–122 (1999) 136–138 www.elsevier.nl/locate/cpc Compositional complexity of DNA sequence models P. Bernaola-Galván a,1 , P. Carpena a , R. Román-Roldán b , J.L. Oliver c a Departamento de Física Aplicada II, Universidad de Málaga, Málaga, Spain b Departamento de Física Aplicada, Universidad de Granada, Granada, Spain c Departamento de Genética e Instituto de Biocomputación, Universidad de Granada, Granada, Spain Abstract Recently, we proposed a new measure of complexity for symbolic sequences (Sequence Compositional Complexity, SCC) based on the entropic segmentation of a sequence into compositionally homogeneous domains. Such segmentation is carried out by means of a conceptually simple, computationally efficient heuristic algorithm. SCC is now applied to the sequences generated by several stochastic models which describe the statistical properties of DNA, in particular the observed long-range fractal correlations. This approach allows us to test the capability of the different models in describing the complex compositional heterogeneity found in DNA sequences. Moreover, SCC detects clear differences where conventional standard methods fail. 1999 Elsevier Science B.V. All rights reserved. 1. Introduction DNA sequences are formed by patches or domains of different nucleotide composition; given the huge spatial heterogeneity of most genomes, the identifi- cation of compositional patches or domains in a se- quence is a critical step in understanding large-scale genome structure. Moreover, in sequences from higher organisms, these domains are organized in very com- plex structures (with fractal properties in many cases), and therefore domains need to be defined on a statisti- cal basis. 2. Sequence compositional complexity To obtain the partition of a given sequence into domains we proposed a segmentation method, based on the Jensen–Shannon entropic divergence (JS m ) [1]. 1 E-mail: rick@ctima.uma.es. We search for the partition that maximizes JS m , defined as: JS m = H [S ]- m i =1 l i L H [S i ], (1) where H [S ] is the Shannon entropy of the sequence of length L, and H [S i ] is the Shannon entropy of i th segment of length l i . As the segmentation is carried out by means of a statistical criterion, a significance level (s) must be established, so the final result depends critically on this parameter. If s is close to 100% a small number of domains is obtained, but with a very significant difference between them; on the contrary, if s is lower the number of domains increases but the difference between them is less significative. In other words: for high values of s only the big scale details of the sequences are revealed, meanwhile by lowering s the small scale structure of the sequence emerges. Since searching for the partition that maximizes (1) requires the solution of a NP-complete prob- 0010-4655/99/$ – see front matter 1999 Elsevier Science B.V. All rights reserved. PII:S0010-4655(99)00298-2