On a Divide and Conquer Approach for Haplotype Inference with Pure Parsimony Konstantinos Kalpakis * , and Parag Namjoshi * * Department of Computer Science and Electrical Engineering University of Maryland Baltimore County, Baltimore, MD 21250 {kalpakis,nam1}@umbc.edu Abstract—The genotype of an individual consists of two DNA strands, where each strand, called a haplotype, is inherited from each parent. The haplotypes of an individual may differ from those of his/her parents due to mutations or crossovers. Those haplotypes and their differences are the genetic basis for hereditary diseases such as Alzheimer’s, diabetes, and heart diseases. Determining the genotype of an individual in the “wet–lab” is relatively inexpensive. However, determining an individual’s haplotypes in the “wet–lab” is rather expensive and time–consuming. Consequently, developing computational methods for inferring haplotypes from genotypes is an impor- tant problem. Since this haplotype inference problem can have multiple solutions, solutions with as few haplotypes as possible are often sought, leading to the Haplotype Inference problem with Pure Parsimony (HIPP). We present a divide–and–conquer approach for HIPP prob- lems with large number of individuals and long genotypes. We explore various design parameters of the divide–and–conquer approach, and experimentally compare its performance, with respect to the number of haplotypes (k) found and the running times, with that of Clark’s rule on both synthetic and real datasets. We find that our approach uses up to 43% less haplotypes than Clark’s rule for synthetic datasets with 100 genotypes of length 100, and that our approach and Clark’s rule use similar number of haplotypes for the real datasets. Our divide–and–conquer approach is an effective and efficient method for large HIPP problems. Index Terms—Bioinformatics, haplotype inference from geno- types, pure parsimony. I. I NTRODUCTION Humans inherit two nearly identical copies of each autoso- mal DNA sequence, called haplotype, from each parent. There are differences between these two sequences and DNA se- quences of other individuals due to mutations and crossovers. Most of these differences are in the form of Single Nucleotide Polymorphisms (SNP), which are the sites (loci) where two or more distinct nucleotides occur in the population. The SNPs are conjectured to form the genetic basis for hereditary diseases such as Alzheimer’s, diabetes, and heart diseases. In the HapMap project, and many other haplotype studies, wet– lab experiments are performed to identify the genotypes of individuals. Though genotypes can be determined in the wet– lab rather inexpensively, it is expensive and time consuming to determine the haplotypes in the wet–lab. Consequently, com- putational methods for inferring haplotypes given genotypes is an important problem. Moreover, since there are multiple solutions to this haplotype inference problem, often solutions with the least number of haplotypes are sought, leading to the Haplotype Inference problem with Pure Parsimony (HIPP). Empirical evidence suggests that a solution with small number of haplotypes is often sufficiently accurate for biological purposes [6], [18]. The HIPP problem asks for a minimum cardinality set of haplotypes resolving given genotypes. The HIPP problem is APX–hard, and hence NP–hard as well. Halld´ orson et al [8] and Gusfield and Orzack [7] provide comprehensive reviews of haplotype inference problems. Brown and Harrower [2] provide an excellent survey for the HIPP problem. Many useful exact and approximate approaches have been suggested for the HIPP problem. Clark [3] describes a simple greedy inference rule to compute a set of haplotypes which resolve a given set of genotypes. Gusfield [6] describes an integer programming scheme, called TIP, for the HIPP problem: enumerate all pairs of haplotypes that could resolve the genotypes, and select, using an integer program, a mini- mum size set of haplotypes. The TIP scheme is practical if the genotypes contain few heterozygous loci. We use linear relaxation of TIP’s integer program as a subroutine in our divide and conquer scheme. Wang and Xu [18] give a branch– and–bound implementation of TIP. Brown and Harrower [1] give another integer programming formulation (HB-IP) for the HIPP problem. They [2] extend HB-IP with limited enumeration and some additional cuts to solve some mod- erately large problems. Kalpakis and Namjoshi [10] give a semidefinite programming formulation for the HIPP problem and its various extensions. They also show that the rank of (suitably transformed) genotype matrix is a lower bound for the HIPP problem. SAT and constraint programming based approaches for HIPP are discussed by Lynce [12], Lynce et al [13], and Neigenfind et at [14]. Maximum likelihood approaches to the haplotype inference differ fundamentally from pure–parsimony approaches. Qin et al [15] describe a divide and conquer algorithm extending a maximum–likelihood approach for haplotype inference in [16]. Eskin, Sharan, and Halperin [5] use a sliding window approach that finds a locally consistent haplotypes instead of minimizing the number of haplotypes used. We provide a simple scalable divide–and–conquer approach for the HIPP problem, and we experimentally compare our approach with existing approaches capable of solving large