Computational Biology and Chemistry 35 (2011) 251–258
Contents lists available at ScienceDirect
Computational Biology and Chemistry
jo ur n al homep age: www.elsevier.com/locate/compbiolchem
An integer programming approach to DNA sequence assembly
YoungJung Chang, Nikolaos V. Sahinidis
∗
Department of Chemical Engineering, Carnegie Mellon University, Pittsburgh, PA, USA
a r t i c l e i n f o
Article history:
Received 19 February 2011
Received in revised form 11 May 2011
Accepted 6 June 2011
Keywords:
DNA sequencing
Short reads assembly
Integer programming
Global optimization
Alternative solutions
a b s t r a c t
De novo sequence assembly is a ubiquitous combinatorial problem in all DNA sequencing technologies.
In the presence of errors in the experimental data, the assembly problem is computationally challenging,
and its solution may not lead to a unique reconstruct. The enumeration of all alternative solutions is
important in drawing a reliable conclusion on the target sequence, and is often overlooked in the heuristic
approaches that are currently available. In this paper, we develop an integer programming formulation
and global optimization solution strategy to solve the sequence assembly problem with errors in the data.
We also propose an efficient technique to identify all alternative reconstructs. When applied to examples
of sequencing-by-hybridization, our approach dramatically increases the length of DNA sequences that
can be handled with global optimality certificate to over 10,000, which is more than 10 times longer than
previously reported. For some problem instances, alternative solutions exhibited a wide range of different
ability in reproducing the target DNA sequence. Therefore, it is important to utilize the methodology
proposed in this paper in order to obtain all alternative solutions to reliably infer the true reconstruct.
These alternative solutions can be used to refine the obtained results and guide the design of further
experiments to correctly reconstruct the target DNA sequence.
© 2011 Elsevier Ltd. All rights reserved.
1. Introduction
The introduction of massively parallel sequencing technolo-
gies such as shotgun strategy enabled completely new fields of
genome sequencing application including the 1000 genomes pro-
ject (The 1000 Genomes Project Consortium, 2010). At the same
time, these methods raised many issues regarding the computa-
tional assembly of large-scale sequencing data, which still remain
unresolved (Scheibye-Alsing et al., 2009). The computational pro-
blem that is common to all the high-throughput techniques for de
novo sequencing is to determine the target DNA sequence from its
subsequences, which are collectively referred to as the spectrum of
the target DNA sequence (Nagarajan and Pop, 2009).
The information available for the target DNA strand depends
on the experimental procedure for determining the spectrum.
For example, the traditional sequencing-by-hybridization (SBH)
returns a set of subsequences (Southern, 1988), which is common
data shared among all the sequencing techniques. In addition, many
sequencers allow the sequencing of paired-ends or mate-pair, which
correspond to both ends of a DNA fragment that is longer than the
read length (Schatz et al., 2010). Sometimes the position of some
subsequences could be measured (Ben-Dor et al., 2001). Additional
∗
Corresponding author.
E-mail addresses: yjchang@andrew.cmu.edu (Y. Chang), sahinidis@cmu.edu
(N.V. Sahinidis).
information from other experiments may also be available (Pihlak
et al., 2008; Skiena and Snir, 2007).
If the spectrum is free of errors, the reconstruction problem can
be reduced to the Eulerian path problem, for which linear-time
algorithms are known (Pevzner, 2000). However, some subsequen-
ces may fail to appear in the spectrum, thus leading to negative
errors. On the other hand, nonexistent subsequences may appear
in the spectrum (e.g., highly stable mismatches in a hybridization
experiment), thus leading to positive errors. In the case of errors,
the sequence assembly problem is known to be NP-complete in the
strong sense even in its simplest form (Bła ˙ zewicz and Kasprzak,
2003; Medvedev et al., 2007). Most of the graph-based assemblers
employ simple heuristics to capture and treat the errors by exa-
mining the graph structure (Miller et al., 2010). However, such an
approach may eliminate correct reads from the spectrum.
In order to handle both types of error in a more systematic
way, optimization approaches can be utilized (Althaus et al., 2009;
Lancia, 2008; Zhang et al., 2007). These approaches select the maxi-
mum number of subsequences in the spectrum to reconstruct the
target sequence, and can incorporate other methods to reconcile
errors (Endo, 2004; Shamir and Tsur, 2002). The resulting opti-
mization problem has been usually tackled using approximation
algorithms or heuristics (Blum et al., 2008; Nikolakopoulos and
Sarimveis, 2008; Tsur, 2005) because it is difficult to solve to global
optimality when the spectrum has more than 1000 subsequen-
ces (Bła ˙ zewicz et al., 1999). Moreover, very few rigorous attempts
have been made at finding multiple reconstructs that are equally
1476-9271/$ – see front matter © 2011 Elsevier Ltd. All rights reserved.
doi:10.1016/j.compbiolchem.2011.06.001