J. Parallel Distrib. Comput. 67 (2007) 1240 – 1255 www.elsevier.com/locate/jpdc Assembling genomes on large-scale parallel computers A. Kalyanaraman a , S.J. Emrich b, c , P.S. Schnable c , d , S. Aluru b, c , ∗ a School of Electrical Engineering and Computer Science, Washington State University, Pullman, WA 99164, USA b Department of Electrical and Computer Engineering, Iowa State University, Ames, IA 50011, USA c Bioinformatics and Computational Biology Graduate Program, Iowa State University, Ames, IA 50011, USA d Departments of Agronomy, and Genetics, Development and Cell Biology, Iowa State University, Ames, IA 50011, USA Received 7 July 2006; received in revised form 1 May 2007; accepted 7 May 2007 Available online 9 June 2007 Abstract Assembly of large genomes from tens of millions of short genomic fragments is computationally demanding requiring hundreds of gigabytes of memory and tens of thousands of CPU hours. The advent of high throughput sequencing technologies, new gene-enrichment sequencing strategies, and collective sequencing of environmental samples further exacerbate this situation. In this paper, we present the ﬁrst massively parallel genome assembly framework. The unique features of our approach include space-efﬁcient and on-demand algorithms that consume only linear space, and strategies to reduce the number of expensive pairwise sequence alignments while maintaining assembly quality. Developed as part of the ongoing efforts in maize genome sequencing, we applied our assembly framework to genomic data containing a mixture of gene enriched and random shotgun sequences. We report the partitioning of more than 1.6 million fragments of over 1.25 billion nucleotides total size into genomic islands in under 2h on 1024 processors of an IBM BlueGene/L supercomputer. We also demonstrate the effectiveness of the proposed approach for traditional whole genome shotgun sequencing and assembly of environmental sequences. © 2007 Elsevier Inc. All rights reserved. Keywords: Computational biology; Genome assembly; Genome sequencing; Parallel algorithms; Sufﬁx trees 1. Introduction Each cell in a living organism contains one or more long DNA sequences called chromosomes, collectively known as the genome. Contained within the genome are DNA sequences called genes that code for proteins and RNA molecules, which perform various cellular functions in an organism. Decipher- ing an entire genome sequence and identifying regions within it that are genes and regulatory elements is of fundamental im- portance in molecular and functional genomics. Genome se- quencing also forms the basis for the rapidly expanding ﬁeld of comparative genomics, which attempts to study genome evolu- tion and unravel genome structure through cross-genome com- parisons. Genomes span multiple length scales—from a few tens of thousands of nucleotides in viruses to millions of nucleotides in microbes to billions of nucleotides in complex eukaryotic ∗ Corresponding author. Department of Electrical and Computer Engineer- ing, Iowa State University, Ames, IA 50011, USA. E-mail address: aluru@iastate.edu (S. Aluru). 0743-7315/$ - see front matter © 2007 Elsevier Inc. All rights reserved. doi:10.1016/j.jpdc.2007.05.014 organisms such as plants and animals. Because DNA is dou- ble stranded, its length is measured in units called base pairs, denoted bp. The biochemical procedure of determining the nucleotide sequence of a DNA molecule is called sequencing. Accurate sequencing is experimentally viable only up to hun- dreds of base pairs (≈ 500–1000 bp). To extend the reach of sequencing to genomic scales, long genomic stretches are sam- pled at uniform random locations by a procedure called shotgun sequencing. This results in numerous short DNA fragments that can be sequenced using conventional techniques. If this procedure is directly applied to an entire genome, it is called whole genome shotgun (WGS) sequencing. After generating and sequencing such fragments, the target genome is computa- tionally assembled from them. The primary information used during assembly is the pairwise overlaps that exist between fragments derived from the same region of the genome. Pair- wise overlaps are detected by computing alignments between the corresponding pairs of fragments using standard dynamic programming approaches. Because such overlaps could also result from fragments derived from different but repetitive parts of the genome, fragments are typically sequenced in pairs