Are isochore sequences homogeneous? Wentian Li * Center for Genomics and Human Genetics, North Shore – LIJ Research Institute, 350 Community Drive, Manhasset, NY 11030, USA Received 21 December 2001; received in revised form 13 May 2002; accepted 17 July 2002 Abstract Three statistical/mathematical analyses are carried out on isochore sequences: spectral analysis, analysis of variance, and segmentation analysis. Spectral analysis shows that there are GC content fluctuations at different length scales in isochore sequences. The analysis of variance shows that the null hypothesis (the mean value of a group of GC contents remains the same along the sequence) may or may not be rejected for an isochore sequence, depending on the subwindow sizes at which GC contents are sampled, and the window size within which group members are defined. The segmentation analysis shows that there are stronger indications of GC content changes at isochore borders than within an isochore. These analyses support the notion of isochore sequences, but reject the assumption that isochore sequences are homogeneous at the base level. An isochore sequence may pass a homogeneity test when GC content fluctuations at smaller length scales are ignored or averaged out. q 2002 Elsevier Science B.V. All rights reserved. Keywords: Isochore sequences; Homogeneous; Statistical/mathematical analyses 1. Introduction In the paper on the human genome draft sequence (Lander et al., 2001), the following comments were made on long- range variation of GC content in DNA sequences and on the topic of ‘isochores’ (Macaya et al., 1976; Cuny et al., 1981; Bernardi, 1995): “We studied the draft genome sequence to see whether strict isochores could be identified. For example, the sequence was divided into 300-kb windows, and each window was subdivided into 20-kb subwindows. We calculated the average GC content for each window and subwindow, and investigated how much of the variance in the GC content of subwindows across the genome can be statistically ‘explained’ by the average GC content in each window. About three-quarters of the genome-wide variance among 20-kb windows can be statistically explained by the average GC content of 300-kb windows that contain them, but the residual variance among subwindows (standard deviation, 2.4%) is still too large to be consistent with a homogeneous distribution. In fact, the hypothesis of homogeneity could be rejected for each 300-kb window in the draft genome sequence. …These results rule out a strict notion of isochores as compositionally homogeneous. Instead, there is a substantial variation at many different scales, … Although isochores do not appear to merit the prefix ‘iso’, the genome clearly does contain large regions of distinctive GC content and it is likely to be worth redefining the concept so that it becomes possible rigorously to partition the genome into regions” (p. 877 of Lander et al., 2001). Several sentences in the above paragraph need further examination. Besides an inappropriate test used in Lander et al. (2001) (see Section 4.1), the discussion on variances of GC contents in windows and subwindows of certain sizes (300 and 20 kb) raises questions. As discussed in Li (2001c), the concept of homogeneity is relative: not only does it depend on the stringency of the criterion, but it also depends on the length scale at which GC contents are examined. It is natural to ask whether other choices of the window and subwindow sizes may change the conclusion concerning homogeneity. Sometimes, short segmentations of DNA sequences (e.g. 1 kb) with extreme high or low GC content are the reason for the sequence to fail a homogeneity test. Nevertheless, these segments are much shorter than the sequence being examined, and might be ignored. With these short-scale fluctuations of base composition averaged out (or ‘coarse graining’, borrowing from a term in statistical physics which specializes in connecting the microscopic and the macroscopic worlds), can a claimed heterogeneous sequence become homogeneous? Besides using variances within and between windows to test homogeneity, in a branch of statistics called ‘change- 0141-933/02/$ - see front matter q 2002 Elsevier Science B.V. All rights reserved. PII: S0378-1119(02)00847-8 Gene 300 (2002) 129–139 www.elsevier.com/locate/gene * Tel.: þ1-516-562-1076; fax: þ 1-516-562-1683. E-mail address: wli@linkage.rockefeller.edu (W. Li), wli@nshs.edu (W. Li). Abbreviations: ANOVA, analysis of variance; BIC, Bayesian information criterion; bp, base pair; GC, nucleotides of either guanine or cytosine; kb, kilo (1000) bases; LLR, log likelihood ratio; Mb, mega (1,000,000) bases; MHC, human major histocompatibility complex.