Design and Implementation of a Database Filter for BLAST Acceleration
Panagiotis Afratis, Constantinos Galanakis, Euripides Sotiriades, Georgios-Grigorios Mplemenos,
Grigorios Chrysos, Ioannis Papaefstathiou, Dionisios Pnevmatikatos
1
Department of Electronic and Computer Engineering,
Technical University of Crete, Chania, GR 73100, Greece
{afratis,esot,chrysos, pnevmati}@mhl.tuc.gr
Abstract — BLAST is a very popular Computational
Biology algorithm. Since it is computationally expensive it
is a natural target for acceleration research, and many
reconfigurable architectures have been proposed offering
significant improvements.
In this paper we approach the same problem with a
different approach: we propose a BLAST algorithm
preprocessor that efficiently identifies the portions of the
database that must be processed by the full algorithm in
order to find the complete set of desired results. We show
that this preprocessing is feasible and quick, and requires
minimal FPGA resources, while achieving a significant
reduction in the size of the database that needs to be
processed by BLAST. We also determine the parameters
under which prefiltering is guaranteed to identify the same
set of solutions as the original NCBI software.
We model our preprocessor in VHDL and implement it
in reconfigurable architecture. To evaluate the
performance, we use a large set of datasets and compare
against the original (NCBI) software. Prefiltering is able to
determine that between 80 and 99.9% of the database will
not produce matches and can be safely ignored. Processing
only the remaining portions using software such as NCBI-
BLAST improves the system performance (reduces
execution time) by 3 to 15 times. Since our prefiltering
technique is generic, it can be combined with any other
software or reconfigurable acceleration technique.
I. INTRODUCTION
BLAST is considered as the most popular and widely used
algorithm of Computational Biology. It is used for searching
large genetic databases in order to find areas of high similarity
(matches) between the database and an input query or it can be
used as a part of other applications as bioinformatics
algorithms.
Since the algorithm is inherently computational intensive,
it has been a challenge during the last few years for several
research groups to build FPGA-based systems to boost this
algorithm performance. RC-BLAST [1] was the first effort to
implement this algorithm using FPGA. Boston University [2],
FPGA/FLASH project [3], TUC BLAST [4], Mercury BLAST
[5-6], and BEE BLAST [7] follow offering significant results.
These projects share common characteristics but each one
differs giving a new architecture with its points of strength. In
this paper we propose a Prefiltering architecture that does not
directly implement the BLAST algorithm, but exploits its
characteristics to tag portions of the database that for the
particular query are deemed “interesting”. This “interesting”
portion of the database is the portion that we should explore in
detail since it is likely (but not certain) that there we will find
areas of high similarity (BLAST matches). A threshold value
can determine the selectivity of the filter. After the tagging,
any BLAST algorithm implementation (software or hardware)
can be used on the identified database subset to determine the
full results. We compare the results of our prefiltering against
NCBI BLAST results with several datasets and queries, and
we show that for a threshold value of 2 our solution is lossless
(i.e. it produces all the results BLAST reports). We also find
that for very large queries pre-filtering is not effective, and we
address these cases by partitioning the query in smaller pieces
and processing them in parallel to achieve the correct overall
results and better filtering behavior. Search space reduction is
80% in the worst case, and up to 99.9% for some of our
experiments. Our pre-filtering approach is versatile and can be
used in combination with any other hardware or software
BLAST implementation.
We show the versatility and efficiency of our approach
using a parallel BLASTn algorithm that has been implemented
on M.PL.EM (Multiprocessor Platform for Embedded
systems) [10]. M.PL.EM is an FPGA-based multiprocessor
consisting of a large number of Xilinx MicroBlaze soft-cores
together with a hierarchical interconnection scheme and a
sophisticated memory subsystem. We use this platform to
post-process the results of our prefiltering system in order to
speed up the execution of BLAST-n algorithm on the
M.PL.EM system.
In the next section we discuss the BLAST behaviour and
characteristics that lead us to our prefiltering approach.
Section 3 expands the description and evaluates BLAST
prefiltering potential. Section 4 and 5 describe the architecture
of the proposed system and present performance
measurements. Finally, in section 6 conclusions and future
work are discussed.
II. PREFILTERING FOR THE BLAST ALGORITHM
The BLAST algorithm takes as input a query and a
database of genetic data. Its operation consists of three steps:
(1) the first step of the algorithm is a preprocessing that breaks
the query into w-mers, that is smaller parts of 12 character
wide substrings (a character is a 2 bit value), (2) in the second
step the database is searched in order to find an exact match
(hit) between any part of the database and any of the w-mers,
1
Dionisios Pnevmatikatos is also with FORTH-ICS.
978-3-9810801-5-5/DATE09 © 2009 EDAA