Computational Biology and Chemistry 35 (2011) 251–258 Contents lists available at ScienceDirect Computational Biology and Chemistry jo ur n al homep age: www.elsevier.com/locate/compbiolchem An integer programming approach to DNA sequence assembly YoungJung Chang, Nikolaos V. Sahinidis ∗ Department of Chemical Engineering, Carnegie Mellon University, Pittsburgh, PA, USA a r t i c l e i n f o Article history: Received 19 February 2011 Received in revised form 11 May 2011 Accepted 6 June 2011 Keywords: DNA sequencing Short reads assembly Integer programming Global optimization Alternative solutions a b s t r a c t De novo sequence assembly is a ubiquitous combinatorial problem in all DNA sequencing technologies. In the presence of errors in the experimental data, the assembly problem is computationally challenging, and its solution may not lead to a unique reconstruct. The enumeration of all alternative solutions is important in drawing a reliable conclusion on the target sequence, and is often overlooked in the heuristic approaches that are currently available. In this paper, we develop an integer programming formulation and global optimization solution strategy to solve the sequence assembly problem with errors in the data. We also propose an efﬁcient technique to identify all alternative reconstructs. When applied to examples of sequencing-by-hybridization, our approach dramatically increases the length of DNA sequences that can be handled with global optimality certiﬁcate to over 10,000, which is more than 10 times longer than previously reported. For some problem instances, alternative solutions exhibited a wide range of different ability in reproducing the target DNA sequence. Therefore, it is important to utilize the methodology proposed in this paper in order to obtain all alternative solutions to reliably infer the true reconstruct. These alternative solutions can be used to reﬁne the obtained results and guide the design of further experiments to correctly reconstruct the target DNA sequence. © 2011 Elsevier Ltd. All rights reserved. 1. Introduction The introduction of massively parallel sequencing technolo- gies such as shotgun strategy enabled completely new ﬁelds of genome sequencing application including the 1000 genomes pro- ject (The 1000 Genomes Project Consortium, 2010). At the same time, these methods raised many issues regarding the computa- tional assembly of large-scale sequencing data, which still remain unresolved (Scheibye-Alsing et al., 2009). The computational pro- blem that is common to all the high-throughput techniques for de novo sequencing is to determine the target DNA sequence from its subsequences, which are collectively referred to as the spectrum of the target DNA sequence (Nagarajan and Pop, 2009). The information available for the target DNA strand depends on the experimental procedure for determining the spectrum. For example, the traditional sequencing-by-hybridization (SBH) returns a set of subsequences (Southern, 1988), which is common data shared among all the sequencing techniques. In addition, many sequencers allow the sequencing of paired-ends or mate-pair, which correspond to both ends of a DNA fragment that is longer than the read length (Schatz et al., 2010). Sometimes the position of some subsequences could be measured (Ben-Dor et al., 2001). Additional ∗ Corresponding author. E-mail addresses: yjchang@andrew.cmu.edu (Y. Chang), sahinidis@cmu.edu (N.V. Sahinidis). information from other experiments may also be available (Pihlak et al., 2008; Skiena and Snir, 2007). If the spectrum is free of errors, the reconstruction problem can be reduced to the Eulerian path problem, for which linear-time algorithms are known (Pevzner, 2000). However, some subsequen- ces may fail to appear in the spectrum, thus leading to negative errors. On the other hand, nonexistent subsequences may appear in the spectrum (e.g., highly stable mismatches in a hybridization experiment), thus leading to positive errors. In the case of errors, the sequence assembly problem is known to be NP-complete in the strong sense even in its simplest form (Bła ˙ zewicz and Kasprzak, 2003; Medvedev et al., 2007). Most of the graph-based assemblers employ simple heuristics to capture and treat the errors by exa- mining the graph structure (Miller et al., 2010). However, such an approach may eliminate correct reads from the spectrum. In order to handle both types of error in a more systematic way, optimization approaches can be utilized (Althaus et al., 2009; Lancia, 2008; Zhang et al., 2007). These approaches select the maxi- mum number of subsequences in the spectrum to reconstruct the target sequence, and can incorporate other methods to reconcile errors (Endo, 2004; Shamir and Tsur, 2002). The resulting opti- mization problem has been usually tackled using approximation algorithms or heuristics (Blum et al., 2008; Nikolakopoulos and Sarimveis, 2008; Tsur, 2005) because it is difﬁcult to solve to global optimality when the spectrum has more than 1000 subsequen- ces (Bła ˙ zewicz et al., 1999). Moreover, very few rigorous attempts have been made at ﬁnding multiple reconstructs that are equally 1476-9271/$ – see front matter © 2011 Elsevier Ltd. All rights reserved. doi:10.1016/j.compbiolchem.2011.06.001