Abstract—The challenge of comparing two or more genomes that have undergone recombination and substantial amounts of segmental loss and gain has recently been addressed for small numbers of genomes. However, datasets of hundreds of genomes are now common and their sizes will only increase in the future. Multiple sequence alignment of hundreds of genomes remains an intractable problem due to quadratic increases in compute time and memory footprint. To date, most alignment algorithms are designed for commodity clusters without parallelism. Hence, we propose the design of a multiple sequence alignment algorithm on massively parallel, distributed memory supercomputers to enable research into comparative genomics on large data sets. Following the methodology of the sequential progressiveMauve algorithm, we design data structures including sequences and sorted k-mer lists on the IBM Blue Gene/P supercomputer (BG/P). Preliminary results show that we can reduce the memory footprint so that we can potentially align over 250 bacterial genomes on a single BG/P compute node. We verify our results on a dataset of E.coli, Shigella and S.pneumoniae genomes. Our implementation returns results matching those of the original algorithm but in 1/2 the time and with 1/4 the memory footprint for scaffold building. In this study, we have laid the basis for multiple sequence alignment of large-scale datasets on a massively parallel, distributed memory supercomputer, thus enabling comparison of hundreds instead of a few genome sequences within reasonable time. I. INTRODUCTION OMPARATIVE genomics relies heavily on the alignment of multiple genomes. With recent advances in microbial diagnostics, typing and surveillance, comparative Manuscript received March 24, 2011. This research was supported by Victorian Life Sciences Computation Initiative (VLSCI) grant numbers VR0126 and VR0082 on its Peak Computing Facility at the University of Melbourne, an initiative of the Victorian Government. P. C. Church is with Deakin University, Science and Technology (corresponding author; phone: +61 3 52271399, e-mail: pcc@deakin.edu.au). A. Goscinski is with Deakin University, Science and Technology (e-mail: andrzej.goscinski@deakin.edu.au). K. Holt is with the Dept. of Microbiology and Immunology at the University of Melbourne, Carlton, VIC, Australia (e-mail: kholt@unimelb.edu.au) M. Inouye is with the Walter and Eliza Hall Institute of Medical Research, Parkville, VIC, Australia and with the Dept. of Medical Biology at the University of Melbourne, Parkville, VIC, Australia (e-mail: inouye@wehi.edu.au) A. Ghoting and K. Makarychev are with the IBM T. J. Watson Research Center, Yorktown Heights, NY, USA (e-mail: {aghoting; konstantin}@us.ibm.com). M. Reumann is with the IBM Research Collaboratory for Life Sciences-Melbourne, Carlton, VIC, Australia and the Dept. of Computer Science and Software Engineering, University of Melbourne (e-mail: mreumann@ieee.org). genomics is playing an increasingly important role in epidemiology, pathogen evolution and the fight against drug resistance. One way to characterize fine-scale genomic variation and confidently infer drug resistance mutations is through aligning and comparing many genomes. Therefore, there is a pressing need for computationally efficient algorithms and tools which are able to scale with current and future genomic datasets. There are many methods of sequence alignment, common algorithms include: ClustalW2 [2]; MUSCLE [3]; T-Coffee [4] and progressiveMauve [5]. The latter is a sequence aligner used predominantly for bacteria. It can be used to find genome re-arrangements and aligns to a scaffold of conserved regions. Alignment makes use of a tree method similar to ClustalW2 and MUSCLE. The progressiveMauve algorithm is ideal for analyzing population diversity as genome rearrangements are taken into account during alignment. However, the current implementation is both computation and memory intensive. It is common for alignment methods to trade between speed and accuracy. Exponential amounts of random access memory (RAM) with respect to input data size are often a key requirement. When aligning large genomic populations, RAM can become a problem. While virtual memory can be used, it often results in a slowdown of computation. Thus, considering that genomic data will only increase in size and volume in the future, the problem of multiple sequence alignment will need high performance computing (HPC) systems. To our knowledge, there is no scalable alignment method available that accurately measures genetic diversity. Further, most alignment algorithms are designed for shared memory computers. These will require terabytes of RAM to carry out multiple sequence alignment of hundreds of genomes. Designing an algorithm for massively parallel, distributed memory machines such as the IBM Blue Gene/P supercomputer (BG/P) will create opportunities for fast, high throughput multiple sequence alignment in the future. In this study we present our design and concept as well as preliminary results of a multiple sequence alignment algorithm for massively parallel, distributed memory HPC systems. II. MATERIALS AND METHODS We choose to parallelize the progressiveMauve algorithm due to its advantages regarding analyzing population diversity. In addition, comparing our results with the original output serves to validate our method. First, we analyze progressiveMauve with respect to regions that have to be carried out sequentially C Philip C. Church, Student Member, IEEE, Andrzej Goscinski, Kathryn Holt, Michael Inouye, Amol Ghoting, Konstantin Makarychev, and Matthias Reumann, Member, IEEE Design of Multiple Sequence Alignment Algorithms on Parallel, Distributed Memory Supercomputers 978-1-4244-4122-8/11/$26.00 ©2011 IEEE 924 33rd Annual International Conference of the IEEE EMBS Boston, Massachusetts USA, August 30 - September 3, 2011