Abstract—The challenge of comparing two or more genomes
that have undergone recombination and substantial amounts of
segmental loss and gain has recently been addressed for small
numbers of genomes. However, datasets of hundreds of genomes
are now common and their sizes will only increase in the future.
Multiple sequence alignment of hundreds of genomes remains
an intractable problem due to quadratic increases in compute
time and memory footprint. To date, most alignment algorithms
are designed for commodity clusters without parallelism.
Hence, we propose the design of a multiple sequence alignment
algorithm on massively parallel, distributed memory
supercomputers to enable research into comparative genomics
on large data sets. Following the methodology of the sequential
progressiveMauve algorithm, we design data structures
including sequences and sorted k-mer lists on the IBM Blue
Gene/P supercomputer (BG/P). Preliminary results show that
we can reduce the memory footprint so that we can potentially
align over 250 bacterial genomes on a single BG/P compute
node. We verify our results on a dataset of E.coli, Shigella and
S.pneumoniae genomes. Our implementation returns results
matching those of the original algorithm but in 1/2 the time and
with 1/4 the memory footprint for scaffold building. In this
study, we have laid the basis for multiple sequence alignment of
large-scale datasets on a massively parallel, distributed memory
supercomputer, thus enabling comparison of hundreds instead
of a few genome sequences within reasonable time.
I. INTRODUCTION
OMPARATIVE genomics relies heavily on the alignment
of multiple genomes. With recent advances in
microbial diagnostics, typing and surveillance, comparative
Manuscript received March 24, 2011. This research was supported by
Victorian Life Sciences Computation Initiative (VLSCI) grant numbers
VR0126 and VR0082 on its Peak Computing Facility at the University of
Melbourne, an initiative of the Victorian Government.
P. C. Church is with Deakin University, Science and Technology
(corresponding author; phone: +61 3 52271399, e-mail:
pcc@deakin.edu.au).
A. Goscinski is with Deakin University, Science and Technology
(e-mail: andrzej.goscinski@deakin.edu.au).
K. Holt is with the Dept. of Microbiology and Immunology at the
University of Melbourne, Carlton, VIC, Australia (e-mail:
kholt@unimelb.edu.au)
M. Inouye is with the Walter and Eliza Hall Institute of Medical
Research, Parkville, VIC, Australia and with the Dept. of Medical Biology at
the University of Melbourne, Parkville, VIC, Australia (e-mail:
inouye@wehi.edu.au)
A. Ghoting and K. Makarychev are with the IBM T. J. Watson Research
Center, Yorktown Heights, NY, USA (e-mail: {aghoting;
konstantin}@us.ibm.com).
M. Reumann is with the IBM Research Collaboratory for Life
Sciences-Melbourne, Carlton, VIC, Australia and the Dept. of Computer
Science and Software Engineering, University of Melbourne (e-mail:
mreumann@ieee.org).
genomics is playing an increasingly important role in
epidemiology, pathogen evolution and the fight against drug
resistance. One way to characterize fine-scale genomic
variation and confidently infer drug resistance mutations is
through aligning and comparing many genomes. Therefore,
there is a pressing need for computationally efficient
algorithms and tools which are able to scale with current and
future genomic datasets.
There are many methods of sequence alignment, common
algorithms include: ClustalW2 [2]; MUSCLE [3]; T-Coffee [4]
and progressiveMauve [5]. The latter is a sequence aligner
used predominantly for bacteria. It can be used to find genome
re-arrangements and aligns to a scaffold of conserved regions.
Alignment makes use of a tree method similar to ClustalW2
and MUSCLE. The progressiveMauve algorithm is ideal for
analyzing population diversity as genome rearrangements are
taken into account during alignment. However, the current
implementation is both computation and memory intensive.
It is common for alignment methods to trade between speed
and accuracy. Exponential amounts of random access memory
(RAM) with respect to input data size are often a key
requirement. When aligning large genomic populations, RAM
can become a problem. While virtual memory can be used, it
often results in a slowdown of computation. Thus, considering
that genomic data will only increase in size and volume in the
future, the problem of multiple sequence alignment will need
high performance computing (HPC) systems.
To our knowledge, there is no scalable alignment method
available that accurately measures genetic diversity. Further,
most alignment algorithms are designed for shared memory
computers. These will require terabytes of RAM to carry out
multiple sequence alignment of hundreds of genomes.
Designing an algorithm for massively parallel, distributed
memory machines such as the IBM Blue Gene/P
supercomputer (BG/P) will create opportunities for fast, high
throughput multiple sequence alignment in the future.
In this study we present our design and concept as well as
preliminary results of a multiple sequence alignment
algorithm for massively parallel, distributed memory HPC
systems.
II. MATERIALS AND METHODS
We choose to parallelize the progressiveMauve algorithm due
to its advantages regarding analyzing population diversity. In
addition, comparing our results with the original output serves
to validate our method. First, we analyze progressiveMauve
with respect to regions that have to be carried out sequentially
C
Philip C. Church, Student Member, IEEE, Andrzej Goscinski, Kathryn Holt, Michael Inouye, Amol
Ghoting, Konstantin Makarychev, and Matthias Reumann, Member, IEEE
Design of Multiple Sequence Alignment Algorithms on Parallel,
Distributed Memory Supercomputers
978-1-4244-4122-8/11/$26.00 ©2011 IEEE 924
33rd Annual International Conference of the IEEE EMBS
Boston, Massachusetts USA, August 30 - September 3, 2011