An evolutionary model for the origin of non-randomness, long-range order and fractality in the genome Yannis Almirantis 1 * and Astero Provata 2 Summary We present a model for genome evolution, comprising biologically plausible events such as transpositions inside the genome and insertions of exogenous se- quences. This model attempts to formulate a minimal proposition accounting for key statistical properties of genomes, avoiding, as far as possible, unsupportable hypotheses for the remote evolutionary past. The statis- tical properties that are observed in genomic sequences and are reproduced by the proposed model are: (i) deviations from randomness at different length scales, measured by suitable algorithms, (ii) a special form of size distribution (power law distribution) characterising dif- ferent levels of genome organisation in the non-coding, and (iii) extensive resemblance in the alternation of coding and non-coding regions at several length scales (self-similarity) in long genomic sequences of higher eukaryotes. BioEssays 23:647±656, 2001. ß 2001 John Wiley & Sons, Inc. Introduction The recent developments in molecular biology, leading to the determination of entire genome sequences, have been paralleled by systematic investigation of the statistical and probablistic aspects of genome organisation. Considerable effort was concentrated at first on the search for systematic differences between coding and non-coding sequences. At this level of organisation, the principal factor affecting the statistics of the ``biological text'' Ð written in the four letter alphabet of the nucleotides Ð is the use of the ``grammar and syntax'' of the triplet code. Several algorithms based on oligonucleotide statistics, codon usage, etc were devel- BioEssays 23:647±656, ß 2001 John Wiley & Sons, Inc. BioEssays 23.7 647 1 Institute of Biology, National Research Centre for Physical Sciences ``Demokritos'', Athens, Greece. 2 Institute of Physical Chemistry, National Research Centre for Physical Sciences ``Demokritos'', Athens, Greece. *Correspondence to: Yannis Almirantis, Institute of Biology, National Research Centre for Physical Sciences ``Demokritos'', 15310 Athens, Greece. E-mail: yalmir@mail.demokritos.gr Box 1: Definitions Random processes: the output of die- or coin-tossing experiments and any process that could be put in one- to-one correspondence to them. For the purposes of our analysis it is useful to list here the following immediate implications of plain randomness on a symbol sequence (even if the involved symbols are not equiprobable): * In a random sequence, the possibility of finding a symbol at a given position does not depend on the previous symbols. * The size distributions of similar-symbol clusters are exponentially decaying in random sequences. Non-random processes: any process that deviates non-trivially from the above. Detrending: the procedure of filtering only specific length scales of interest and ignoring larger and/or shorter length scale features. Long-range correlations (LRC). Correlations are the result of interactions between different constituents of a system. When these interactions extend within the entire system, then, the correlations are called long- range. Power law distributions. LRC often characterise symbol sequences with over-represented long tracks (clusters) of similar symbols. More rigorously, indication of LRC is linearity in double logarithmic scale of the cluster size distribution of similar symbols for some length scales. This is the so-called power law distribu- tion. Fractal: an object whose characteristic features grow (scale) with the object's size with a power D f smaller than the spatial dimensionality of the object. Self-similar: any object whose statistical properties are independent of the observation scale. Intuitively, self-similar objects resemble themselves seen in different magnifications. Problems and paradigms