Maximum likelihood model based on minor allele frequencies and weighted Max-SAT formulation for haplotype assembly Sayyed R. Mousavi a,b,n , Ilnaz Khodadadi a , Hossein Falsafain a , Reza Nadimi a , Nasser Ghadiri a a Department of Electrical and Computer Engineering, Isfahan University of Technology, Isfahan 84156-83111, Iran b School of Computer Science, Institute for Research in Fundamental Sciences, Tehran, Iran HIGHLIGHTS New probabilistic models for haplotype assembly. Based on the maximum likelihood paradigm using minor allele frequencies. A theoretical support for the minimum error correction model. A weighted Max-SAT formulation for a simplied model. Accuracy improvement conrmed by experimental results. article info Article history: Received 3 December 2012 Received in revised form 24 January 2014 Accepted 24 January 2014 Available online 31 January 2014 Keywords: Single individual haplotyping Single nucleotide polymorphism Haplotype reconstruction Minimum error correction Algorithms abstract Human haplotypes include essential information about SNPs, which in turn provide valuable information for such studies as nding relationships between some diseases and their potential genetic causes, e.g., for Genome Wide Association Studies. Due to expensiveness of directly determining haplotypes and recent progress in high throughput sequencing, there has been an increasing motivation for haplotype assembly, which is the problem of nding a pair of haplotypes from a set of aligned fragments. Although the problem has been extensively studied and a number of algorithms have already been proposed for the problem, more accurate methods are still benecial because of high importance of the haplotypes information. In this paper, rst, we develop a probabilistic model, that incorporates the Minor Allele Frequency (MAF) of SNP sites, which is missed in the existing maximum likelihood models. Then, we show that the probabilistic model will reduce to the Minimum Error Correction (MEC) model when the information of MAF is omitted and some approximations are made. This result provides a novel theoretical support for the MEC, despite some criticisms against it in the recent literature. Next, under the same approximations, we simplify the model to an extension of the MEC in which the information of MAF is used. Finally, we extend the haplotype assembly algorithm HapSAT by developing a weighted Max-SAT formulation for the simplied model, which is evaluated empirically with positive results. & 2014 Elsevier Ltd. All rights reserved. 1. Introduction Each chromosome in human being, as a diploid organism, consists of two haplotypes, one inherited from the mother and the other from the father. Haplotypes contain Single Nucleo- tides Polymorphisms (SNPs), which provide valuable information for many genomics research purposes, e.g. for Genome Wide Association Studies (GWAS) (Hirschhorn and Daly, 2005). Because of expensiveness of directly determining haplotypes, the most common method for this purpose is to obtain them computation- ally from a given set of aligned fragments, which is known as the problem of Single Individual Haplotyping (SIH) or haplotype assembly. The use of computational methods has increased in recent years because of impressive progress in computational biology, especially in Whole Genome Sequencing (WGS), and Next Generation Sequencing (NGS) technologies (Levy et al., 2007; The International HapMap Consortium, 2005). Using the current sequencing technology, it is not known from which haplotype a read fragment is obtained. In the haplotype Contents lists available at ScienceDirect journal homepage: www.elsevier.com/locate/yjtbi Journal of Theoretical Biology http://dx.doi.org/10.1016/j.jtbi.2014.01.036 0022-5193 & 2014 Elsevier Ltd. All rights reserved. n Corresponding author at: Department of Electrical and Computer Engineering, Isfahan University of Technology, Isfahan 84156-83111, Iran. Tel.: þ98 311 391 2450; fax: þ98 311 391 2451. E-mail address: srm@cc.iut.ac.ir (S.R. Mousavi). Journal of Theoretical Biology 350 (2014) 4956