J. Parallel Distrib. Comput. 67 (2007) 1240 – 1255
www.elsevier.com/locate/jpdc
Assembling genomes on large-scale parallel computers
A. Kalyanaraman
a
, S.J. Emrich
b, c
, P.S. Schnable
c , d
, S. Aluru
b, c , ∗
a
School of Electrical Engineering and Computer Science, Washington State University, Pullman, WA 99164, USA
b
Department of Electrical and Computer Engineering, Iowa State University, Ames, IA 50011, USA
c
Bioinformatics and Computational Biology Graduate Program, Iowa State University, Ames, IA 50011, USA
d
Departments of Agronomy, and Genetics, Development and Cell Biology, Iowa State University, Ames, IA 50011, USA
Received 7 July 2006; received in revised form 1 May 2007; accepted 7 May 2007
Available online 9 June 2007
Abstract
Assembly of large genomes from tens of millions of short genomic fragments is computationally demanding requiring hundreds of gigabytes
of memory and tens of thousands of CPU hours. The advent of high throughput sequencing technologies, new gene-enrichment sequencing
strategies, and collective sequencing of environmental samples further exacerbate this situation. In this paper, we present the first massively
parallel genome assembly framework. The unique features of our approach include space-efficient and on-demand algorithms that consume only
linear space, and strategies to reduce the number of expensive pairwise sequence alignments while maintaining assembly quality. Developed
as part of the ongoing efforts in maize genome sequencing, we applied our assembly framework to genomic data containing a mixture of gene
enriched and random shotgun sequences. We report the partitioning of more than 1.6 million fragments of over 1.25 billion nucleotides total
size into genomic islands in under 2h on 1024 processors of an IBM BlueGene/L supercomputer. We also demonstrate the effectiveness of
the proposed approach for traditional whole genome shotgun sequencing and assembly of environmental sequences.
© 2007 Elsevier Inc. All rights reserved.
Keywords: Computational biology; Genome assembly; Genome sequencing; Parallel algorithms; Suffix trees
1. Introduction
Each cell in a living organism contains one or more long
DNA sequences called chromosomes, collectively known as
the genome. Contained within the genome are DNA sequences
called genes that code for proteins and RNA molecules, which
perform various cellular functions in an organism. Decipher-
ing an entire genome sequence and identifying regions within
it that are genes and regulatory elements is of fundamental im-
portance in molecular and functional genomics. Genome se-
quencing also forms the basis for the rapidly expanding field of
comparative genomics, which attempts to study genome evolu-
tion and unravel genome structure through cross-genome com-
parisons.
Genomes span multiple length scales—from a few tens of
thousands of nucleotides in viruses to millions of nucleotides
in microbes to billions of nucleotides in complex eukaryotic
∗
Corresponding author. Department of Electrical and Computer Engineer-
ing, Iowa State University, Ames, IA 50011, USA.
E-mail address: aluru@iastate.edu (S. Aluru).
0743-7315/$ - see front matter © 2007 Elsevier Inc. All rights reserved.
doi:10.1016/j.jpdc.2007.05.014
organisms such as plants and animals. Because DNA is dou-
ble stranded, its length is measured in units called base pairs,
denoted bp. The biochemical procedure of determining the
nucleotide sequence of a DNA molecule is called sequencing.
Accurate sequencing is experimentally viable only up to hun-
dreds of base pairs (≈ 500–1000 bp). To extend the reach of
sequencing to genomic scales, long genomic stretches are sam-
pled at uniform random locations by a procedure called shotgun
sequencing. This results in numerous short DNA fragments
that can be sequenced using conventional techniques. If this
procedure is directly applied to an entire genome, it is called
whole genome shotgun (WGS) sequencing. After generating
and sequencing such fragments, the target genome is computa-
tionally assembled from them. The primary information used
during assembly is the pairwise overlaps that exist between
fragments derived from the same region of the genome. Pair-
wise overlaps are detected by computing alignments between
the corresponding pairs of fragments using standard dynamic
programming approaches. Because such overlaps could also
result from fragments derived from different but repetitive
parts of the genome, fragments are typically sequenced in pairs