KURIR. Kingston University Research & Innovation Reports. Nebel & Pezzulli Distribution of Human Genes Observes Zipf’s Law. Vol. 8, 2012. - 1 - ISSN Number: 1749-5652 Distribution of Human Genes Observes Zipf’s Law. Jean-Christophe Nebel and Sergio Pezzulli Faculty of Science, Engineering and Computing Kingston University, London KT1 2EE Keywords Human genome, gene distributions, chromosomes, mathematical models, Benford‘s law, Zipf‘s law, gene detection, gene annotation, bioinformatics, data mining. Abstract Recent research suggests that gene distribution on chromosomes can be informative about their nature. Consequently, gene distribution analysis may contribute not only to better gene detection, but also to better gene annotation, which is particularly important to high-throughput genome projects. This paper investigates possible mathematical models, namely Benford‘s and Zipf‘s law, to describe gene‘s position distributions on human chromosomes. After a review of phenomena following either of these laws, it is shown that observance of Benford‘s law has to be rejected. However, most human chromosomes display gene distributions which can be accurately modelled by Zipf‘s law. This discovery may impact the analysis of genome sequence data since the proposed gene distribution model could be integrated in software involved in gene detection. Introduction Recent research suggests that not only gene distribution on chromosomes is not random (Rafiee et al., 2008), but their location can be informative about their nature. A study of lineage-specific genes in Plasmodium revealed that species-specific genes are located near chromosome ends (Kuo & Kissinger, 2008). Moreover, experiment conducted on C elegans indicates that gene positions on chromosomes impact on physical trait variability (Rockman et al., 2010). These findings suggest the analysis of gene distribution on chromosomes may contribute not only to better gene detection, but also to better gene annotation. This is particularly relevant to high- throughput genome projects where better automatic annotation methods are required (Yang et al., 2010). This paper intends to contribute to this field by providing a mathemati cal model of gene‘s position distributions on human chromosomes. Independently, Newcomb (1881) and Benford (1938) observed that the usage of logarithm books followed a very specific distribution, now called Benford‘s law, where numbers starting with a digit d are more frequent than those starting with the digit d+1. More specifically, this is expressed by the following equation where P(d) is the probability of observing a number starting with the digit d: