IEICE TRANS. FUNDAMENTALS, VOL.E99–A, NO.3 MARCH 2016 683 PAPER Novel Reconfigurable Hardware Accelerator for Protein Sequence Alignment Using Smith-Waterman Algorithm Atef IBRAHIM † ,†† a) , Nonmember, Hamed ELSIMARY † , Member, and Abdullah ALJUMAH † , Nonmember SUMMARY This paper presents novel reconfigurable semi-systolic ar- ray architecture for the Smith-Waterman with an affine gap penalty algo- rithm to align protein sequences optimized for shorter database sequences. This architecture has been modified to enable hardware reuse rather than replicating processing elements of the semi-systolic array in multiple FP- GAs. The proposed hardware architecture and the previously published conventional one are described at the Register Transfer Level (RTL) using VHDL language and implemented using the FPGA technology. The results show that the proposed design has significant higher normalized speedup (up to 125%) over the conventional one for query sequence lengths less than 512 residues. According to the UniProtKB/TrEMBL protein database (release 2015 05) statistics, the largest number of sequences (about 80%) have sequence length less than 512 residues that makes the proposed de- sign outperforms the conventional one in terms of speed and area in this sequence lengths range. key words: semi-systolic arrays, bio-informatics, protein sequence align- ment, Smith-Waterman alignment algorithm, biological computation, re- configurable computing 1. Introduction Protein alignment by Dynamic Programming based (DP- based) algorithms using general purpose processors (micro- processors) results in quadratic time complexities. Because of the exponential growth of biological databases, there has been an enormous increase in research, which has focused on accelerating DP-based algorithms. These algorithms are accelerated in parallel architectures such as linear single in- struction multiple data (SIMD) arrays and systolic arrays. Both architectures are good candidates for fine-grained par- allel architectures for the acceleration of sequence align- ment with DP-based algorithms. Coarse-grained parallelism is another approach, where computations of DP-based algo- rithms are distributed over multiprocessor clusters. In spite of the fact that coarse-grained parallelism significantly in- creases computation speed, such implementations consume significant amounts of energy as well as involving increased size, operational costs, and maintenance. On the other hand, fine-grained parallelism using systolic arrays has been im- plemented on both FPGA and ASIC platforms. The lat- ter implements systolic arrays in a single-purpose chip and has provided relatively good area/speed ratios; however, Manuscript received May 20, 2015. Manuscript revised August 26, 2015. † The authors are with the Faculty of Computer Engineering and Sciences, Prince Sattam Bin Abdulaziz University, Alkharj, KSA. †† The author is also with the Electronics Research Institute, Cairo, Egypt. a) E-mail: attif ali2002@yahoo.com DOI: 10.1587/transfun.E99.A.683 the single purpose hardware lacks the re-programmability which is important for sequence alignment. Over the last decades, reconfigurable FPGAs have become an important alternative to the expensive and large energy consumption of high performance supercomputers and multiprocessor clus- ters. The remarkable speedup of FPGAs in accelerating bio- computing algorithms has led to such reconfigurable com- puting platforms being used as acceleration platforms for scientific computing. Advances in IC technology over the last decade have led to production of FPGAs with large sizes that can ac- commodate complex designs available nowadays. This has led researchers to utilize FPGAs for the implementation of the Smith Waterman alignment algorithm with the affine gap penalty [1]. Human Genome Project (HGP) in 2003, provided databases with massive numbers of biological se- quences [2]. Due to the massive number of biological se- quences, they require a large time and resources to be pro- cessed, which represents a real challenge to the available technology. The implementation of the Smith-Waterman with an affine gap penalty on FPGA as in [3] was among the early works reported in literature. Yamaguchi et al. [4] in 2002 implemented the Smith-Waterman with an affine gap func- tion on the RC 1000-PP Celoxica board with Virtex-II FPGA. During that time, Virtex-II was the latest FPGAs and the Xilinx XCV2000E device fitted a maximum of 144 pro- cessing elements. Oliver et al. in [3] presented in their work the run time reconfiguration of Processing Elements (PEs) in order to reuse the resources. Jacobi et al. [5] realized a re- configurable system to implement the algorithm on Virtex-II FPGA board. Similar work presented by VanCourt and Her- bordt in [6] and Hoang et al. in [7]. Mohamed Abouellail et al. in [8] presented PE systolic arrays on a cluster of multiple FPGAs, as reported to solve the problem of long queries. Another approach to solve for long queries has also been reported in [9], [10], [11], [12], [13]. They presented the reuse of PEs in a technique known as folding. Exam- ples of such an approach were reported by Xian-yang et al. [13]. Zhang et al. in [14] presented a new implementation of the Smith-Waternam algorithm on reconfigurable FPGAs in a supercomputer fashion by redesigning the PEs in a way to reduce the storage accompanied with each PE to enable rapid access to the substitution matrix as it was stored in the PE. Yamaguchi et al. in [15] presented similar technique to store the substitution matrix in the PE for multiple-pass computation. Isa et al. in [16] presented a systolic array with Copyright c 2016 The Institute of Electronics, Information and Communication Engineers