Inter-dinucleotide distances in the human genome: an analysis of the whole-genome and protein-coding distributions Carlos A. C. Bastos 1,2* , Vera Afreixo 1,3 , Armando J. Pinho 1,2 , Sara P. Garcia 1 , Jo ˜ ao M. O. S. Rodrigues 1,2 , Paulo J. S. G. Ferreira 1,2 1 Signal Processing Lab, IEETA, University of Aveiro, 3810-193 Aveiro, Portugal 2 Department of Electronics, Telecommunications and Informatics, University of Aveiro, 3810-193 Aveiro, Portugal 3 Department of Mathematics, University of Aveiro, 3810-193 Aveiro, Portugal Summary We study the inter-dinucleotide distance distributions in the human genome, both in the whole-genome and protein-coding regions. The inter-dinucleotide distance is deﬁned as the distance to the next occurrence of the same dinucleotide. We consider the 16 sequences of inter-dinucleotide distances and two reading frames. Our results show a period-3 os- cillation in the protein-coding inter-dinucleotide distance distributions that is absent from the whole-genome distributions. We also compare the distance distribution of each din- ucleotide to a reference distribution, that of a random sequence generated with the same dinucleotide abundances, revealing the CG dinucleotide as the one with the highest cumu- lative relative error for the ﬁrst 60 distances. Moreover, the distance distribution of each dinucleotide is compared to the distance distribution of all other dinucleotides using the Kullback-Leibler divergence. We ﬁnd that the distance distribution of a dinucleotide and that of its reversed complement are very similar, hence, the divergence between them is very small. This is an interesting ﬁnding that may give evidence of a stronger parity rule than Chargaff’s second parity rule. 1 Introduction Finding and understanding correlation structures in genomic sequences has been the goal of many studies, and several methodologies have been employed, including heterogeneities in compositional biases, segmentation techniques, entropy measures, correlation functions, Fourier analysis, wavelet analysis or the analysis of self-similarities (e.g. [20, 12, 15, 4, 3, 2, 9, 7]). We aimed at contributing to this goal by studying the distribution of the distances between sim- ilar n-mers. We started with nucleotides, by exploring the inter-nucleotide distance (i.e. the distance to the next occurrence of the same nucleotide) in the genomes of organisms from the three domains of life, and using the distributions for inferring phylogenies [1]. Here, we address the distribution of inter-dinucleotide distances (i.e. the distance to the next oc- currence of the same dinucleotide) and focus our analysis on the human genome. Dinucleotides have a prominent role in genome biology, hence, studying their content and distribution is key * To whom correspondence should be addressed. Email: cbastos@ua.pt Copyright 2011 The Author(s). Published by Journal of Integrative Bioinformatics. This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivs 3.0 Unported License (http://creativecommons.org/licenses/by-nc-nd/3.0/). Journal of Integrative Bioinformatics, 8(3):172, 2011 http://journal.imbio.de doi:10.2390/biecoll-jib-2011-172 1