A novel efficient dynamic programming algorithm for haplotype block partitioning J. Zahiri b , G. Mahdevar b , A. Nowzari-dalini a,n , H. Ahrabian a , M. Sadeghi c,d a Center of Excellence in Biomathematics, School of Mathematics, Statistics, and Computer science, University of Tehran, Tehran, Iran b Departments of Bioinformatics, Institute of Biochemistry and Biophysics, University of Tehran, Tehran, Iran c National Institute of Genetic Engendering and Biotechnology, Tehran, Iran d School of Computer Science, Institute for Research in Fundamental Sciences (IPM), Tehran, Iran article info Article history: Received 30 July 2009 Received in revised form 10 August 2010 Accepted 16 August 2010 Available online 20 August 2010 Keywords: Genomics Haplotype diversity SNP Tag SNP abstract In this paper, a new efficient algorithm is presented for haplotype block partitioning based on haplotype diversity. In this algorithm, finding the largest meaningful block that satisfies the diversity condition is the main goal as an optimization problem. The algorithm can be performed in polynomial time complexity with regard to the number of haplotypes and SNPs. We apply our algorithm on three biological data sets from chromosome 21 in three different population data sets from HapMap data bulk; the obtained results show the efficiency and better performance of our algorithm in comparison with three other well known methods. & 2010 Elsevier Ltd. All rights reserved. 1. Introduction After releasing human genome data, one of the major interest of researchers is to understand the genomic differences in human population. Most of the genomic variation in population is due to single nucleotide polymorphisms (SNPs) (Gray et al., 2000). A SNP refers to the existence of two specific nucleotides at a single locus in a population. Haplotype can be defined as an asset of SNPs on a single chromosome that are associated and inherited as a unit. Recently, haplotype analysis has been successfully applied to the identification of DNA variations relevant to several common and complex diseases (Bonnen et al., 2002; Indap et al., 2005; Mas et al., 2005; Reif et al., 2006; Gray et al., 2000; Nowotny et al., 2001). Many studies suggest that human genome may be arranged into block structure, in which SNPs are relevant and only a small number of SNPs are sufficient to capture most of haplotype structures, called tag SNP (Daly et al., 2001; Gabriel et al., 2002; Patil et al., 2001; Dawson et al., 2002; Mahdevar et al., 2010; Zhang et al., 2002; Wall and Pritchard, 2003). Several methods have been suggested for defining block structure from which some are more commonly used. Three main criteria for haplotype block partitioning are based on haplotype diversity (Patil et al., 2001; Johnson et al., 2001), linkage disequilibrium (LD) (Gabriel et al., 2002; Greenspan and Geiger, 2004) and four gamete test (Wang et al., 2002; Hudson and Kaplan, 1985). In diversity based methods, a block is defined as a region, in which a certain percentage of haplotypes are common haplotypes, haplotypes which are represented more than certain percent in the population. In LD based methods, a block is defined as a region with high pair-wise LD within block and low pair-wise LD between blocks. In methods based on four gamete test, a block is defined as a recombination-free region of consecutive SNPs. Based on the three above mentioned criteria, many block partitioning algorithms have been designed. Patil et al. (2001) used a diversity based greedy algorithm to partition chromosome 21 into haplotype blocks in a sample of 20 re-sequenced chromosomes. They considered all blocks of consecutive SNPs of one SNP or larger and defined a haplotype block boundary, where at least 80% of observed haplotypes within a block were represented at least two or more times in their sample of chromosomes (these haplotypes are called common haplotypes). Zhang et al. (2004) subsequently provided a dynamic program- ming implementation for this approach as a software Hap-Block (Zhang et al., 2005). They applied their method on chromosome 21 data that was used by Patil et al. and got a less number of blocks and tag SNPs in comparison with the Patil method. They also obtained blocks with greater average length. Gabriel et al. (2002) used an LD-based algorithm to define haplotype blocks in a worldwide sample of chromosomes from Africa, Asia, and Europe. They computed confidence bounds of the value of D 0 , a standard measurement of an LD, and defined pairs of SNPs to be in strong LD (little evidence of recombination), if the one-sided 95% D 0 Contents lists available at ScienceDirect journal homepage: www.elsevier.com/locate/yjtbi Journal of Theoretical Biology 0022-5193/$ - see front matter & 2010 Elsevier Ltd. All rights reserved. doi:10.1016/j.jtbi.2010.08.019 n Corresponding author. Tel.: +98 21 61112916. E-mail address: nowzari@ut.ac.ir (A. Nowzari-dalini). Journal of Theoretical Biology 267 (2010) 164–170