IEICE TRANS. FUNDAMENTALS, VOL.E99–A, NO.3 MARCH 2016 683 PAPER Novel Reconﬁgurable Hardware Accelerator for Protein Sequence Alignment Using Smith-Waterman Algorithm Atef IBRAHIM † ,†† a) , Nonmember, Hamed ELSIMARY † , Member, and Abdullah ALJUMAH † , Nonmember SUMMARY This paper presents novel reconﬁgurable semi-systolic ar- ray architecture for the Smith-Waterman with an aﬃne gap penalty algo- rithm to align protein sequences optimized for shorter database sequences. This architecture has been modiﬁed to enable hardware reuse rather than replicating processing elements of the semi-systolic array in multiple FP- GAs. The proposed hardware architecture and the previously published conventional one are described at the Register Transfer Level (RTL) using VHDL language and implemented using the FPGA technology. The results show that the proposed design has signiﬁcant higher normalized speedup (up to 125%) over the conventional one for query sequence lengths less than 512 residues. According to the UniProtKB/TrEMBL protein database (release 2015 05) statistics, the largest number of sequences (about 80%) have sequence length less than 512 residues that makes the proposed de- sign outperforms the conventional one in terms of speed and area in this sequence lengths range. key words: semi-systolic arrays, bio-informatics, protein sequence align- ment, Smith-Waterman alignment algorithm, biological computation, re- conﬁgurable computing 1. Introduction Protein alignment by Dynamic Programming based (DP- based) algorithms using general purpose processors (micro- processors) results in quadratic time complexities. Because of the exponential growth of biological databases, there has been an enormous increase in research, which has focused on accelerating DP-based algorithms. These algorithms are accelerated in parallel architectures such as linear single in- struction multiple data (SIMD) arrays and systolic arrays. Both architectures are good candidates for ﬁne-grained par- allel architectures for the acceleration of sequence align- ment with DP-based algorithms. Coarse-grained parallelism is another approach, where computations of DP-based algo- rithms are distributed over multiprocessor clusters. In spite of the fact that coarse-grained parallelism signiﬁcantly in- creases computation speed, such implementations consume signiﬁcant amounts of energy as well as involving increased size, operational costs, and maintenance. On the other hand, ﬁne-grained parallelism using systolic arrays has been im- plemented on both FPGA and ASIC platforms. The lat- ter implements systolic arrays in a single-purpose chip and has provided relatively good area/speed ratios; however, Manuscript received May 20, 2015. Manuscript revised August 26, 2015. † The authors are with the Faculty of Computer Engineering and Sciences, Prince Sattam Bin Abdulaziz University, Alkharj, KSA. †† The author is also with the Electronics Research Institute, Cairo, Egypt. a) E-mail: attif ali2002@yahoo.com DOI: 10.1587/transfun.E99.A.683 the single purpose hardware lacks the re-programmability which is important for sequence alignment. Over the last decades, reconﬁgurable FPGAs have become an important alternative to the expensive and large energy consumption of high performance supercomputers and multiprocessor clus- ters. The remarkable speedup of FPGAs in accelerating bio- computing algorithms has led to such reconﬁgurable com- puting platforms being used as acceleration platforms for scientiﬁc computing. Advances in IC technology over the last decade have led to production of FPGAs with large sizes that can ac- commodate complex designs available nowadays. This has led researchers to utilize FPGAs for the implementation of the Smith Waterman alignment algorithm with the aﬃne gap penalty [1]. Human Genome Project (HGP) in 2003, provided databases with massive numbers of biological se- quences [2]. Due to the massive number of biological se- quences, they require a large time and resources to be pro- cessed, which represents a real challenge to the available technology. The implementation of the Smith-Waterman with an aﬃne gap penalty on FPGA as in [3] was among the early works reported in literature. Yamaguchi et al. [4] in 2002 implemented the Smith-Waterman with an aﬃne gap func- tion on the RC 1000-PP Celoxica board with Virtex-II FPGA. During that time, Virtex-II was the latest FPGAs and the Xilinx XCV2000E device ﬁtted a maximum of 144 pro- cessing elements. Oliver et al. in [3] presented in their work the run time reconﬁguration of Processing Elements (PEs) in order to reuse the resources. Jacobi et al. [5] realized a re- conﬁgurable system to implement the algorithm on Virtex-II FPGA board. Similar work presented by VanCourt and Her- bordt in [6] and Hoang et al. in [7]. Mohamed Abouellail et al. in [8] presented PE systolic arrays on a cluster of multiple FPGAs, as reported to solve the problem of long queries. Another approach to solve for long queries has also been reported in [9], [10], [11], [12], [13]. They presented the reuse of PEs in a technique known as folding. Exam- ples of such an approach were reported by Xian-yang et al. [13]. Zhang et al. in [14] presented a new implementation of the Smith-Waternam algorithm on reconﬁgurable FPGAs in a supercomputer fashion by redesigning the PEs in a way to reduce the storage accompanied with each PE to enable rapid access to the substitution matrix as it was stored in the PE. Yamaguchi et al. in [15] presented similar technique to store the substitution matrix in the PE for multiple-pass computation. Isa et al. in [16] presented a systolic array with Copyright c  2016 The Institute of Electronics, Information and Communication Engineers