Parallel Biological Sequence Comparison on Heterogeneous High Performance Computing Platforms with BSP++ Khaled Hamidouche Universite Paris-Sud Orsay, France hamidou@lri.fr Fernando M. Mendonca University of Brasilia ICC Norte, Brasilia, Brazil foxdie86@gmail.com Joel Falcou Universite Paris-Sud Orsay, France joel.falcou@lri.fr Daniel Etiemble Universite Paris-Sud Orsay, France de@lri.fr Abstract Biological Sequence Comparison is an important operation in Bioinformatics that is often used to re- late organisms. Smith and Waterman proposed an exact algorithm (SW) that compares two sequences in quadratic time and space. Due to high comput- ing and memory requirements, SW is usually exe- cuted on HPC platforms such as multicore clusters and CellBEs. Since HPC architectures exhibit very differ- ent hardware characteristics, porting an application between them is an error-prone time-consuming task. BSP++ is an implementation of BSP that aims to re- duce the effort to write parallel code. In this paper, we propose and evaluate a parallel BSP++ strategy to execute SW in multiple platforms like MPI, OpenMP, MPI/OpenMP, CellBE and MPI/CellBE. The results obtained with real DNA sequences show that the per- formance of our versions is comparable to the ones in the literature, evidencing the appropriateness and flex- ibility of our approach. 1. Introduction Once a new biological sequence is discovered, its functional/structural characteristics must be estab- lished. In order to do that, the newly discovered se- quence is compared against other sequences, looking for similarities. Sequence comparison is, therefore, one of the most basic operations in Bioinformatics. The most accurate algorithm to execute pairwise com- parisons is the one proposed by Smith and Waterman (SW) [21], that is based on dynamic programming, with quadratic time and space complexity. This can easily lead to extremely high execution times and huge memory requirements, since biological databases are growing exponentially. Parallel processing can be used to produce results faster, reducing significantly the time needed to ob- tain results with the SW algorithm. Indeed, many pro- posals do exist to execute SW on clusters [5] [17] [7] and grids [23]. More recently, accelerators such as the CellBE and GPUs have been explored to execute SW [26] [2] [19] [13]. Nevertheless, most of the proposals in the literature were developed for a particular target architecture, using specific optimization techniques. In this scenario, programmers need to study the pro- gramming interface of the target platform and repro- gram the entire application. For this reason, paral- lel programming languages, tools and cross-compilers have been proposed such as [9], [3], [12], [8] and [10] that are able to generate code for multiple platforms. Following this trend, we proposed BSP++ [10], a lightweight object-oriented library that implements the Bulk Synchronous Parallel Model (BSP). By providing an hierarchical parallel view of the BSP model and few primitives that can be invoked by standard C++ pro- grams, it is able to be compiled for multiple parallel architectures, while requiring minimal code rewriting. In this paper, we propose a parallel BSP implementa- tion of the SW algorithm on multiple HPC platforms using BSP++ for five parallel versions (MPI, OpenMP, MPI+OpenMP, CellBE and MPI+CellBE) that were executed on multiple platforms. Executing our BSP++ versions on platforms with up to 128 cores, we show that all versions delivers very good GCUPS performances. For instance, when com- paring 1MBP (Million of Base Pairs) real sequences on a 128-core cluster with the MPI+OpenMP version, the execution time is reduced from 3h27min (one core) to less than 2min (128 cores), yielding a 10.41GCUPS throughput. Moreover, we show that our CellBE SW versions are able to compare larger sequences than previous CellBE proposals. The remainder of this paper is organized as follows.