Using GPUs for the Exact Alignment of Short-Read Genetic Sequences by Means of the Burrows-Wheeler Transform Jose ´ Salavert Torres, Ignacio Blanquer Espert, Andre ´ s Toma ´ s Domı ´nguez, Vicente Herna ´ ndez Garcı ´a, Ignacio Medina Castello ´, Joaquı ´n Ta ´rraga Gime ´nez, and Joaquı ´n Dopazo Bla ´ zquez Abstract—General Purpose Graphic Processing Units (GPGPUs) constitute an inexpensive resource for computing-intensive applications that could exploit an intrinsic fine-grain parallelism. This paper presents the design and implementation in GPGPUs of an exact alignment tool for nucleotide sequences based on the Burrows-Wheeler Transform. We compare this algorithm with state-of-the- art implementations of the same algorithm over standard CPUs, and considering the same conditions in terms of I/O. Excluding disk transfers, the implementation of the algorithm in GPUs shows a speedup larger than 12, when compared to CPU execution. This implementation exploits the parallelism by concurrently searching different sequences on the same reference search tree, maximizing memory locality and ensuring a symmetric access to the data. The paper describes the behavior of the algorithm in GPU, showing a good scalability in the performance, only limited by the size of the GPU inner memory. Index Terms—Short-read alignment, CUDA, GPU, Burrows-Wheeler Transform. Ç 1 INTRODUCTION N OWADAYS, the cost reduction with new sequencing technologies has fostered the consideration of much more data than before. A single biological experiment launched on a current DNA sequencing machine can easily produce hundreds of Gigabytes or even Terabytes of data, and it is forecasted that it could be worse as DNA sequencers are being continuously enhanced. This well-known “data deluge” poses the bioinformaticians against the challenge of processing and analyzing this avalanche of data [1]. However, efficiently transferring genomic data consti- tutes a challenge that reduces the advantages of e-science computing infrastructures, such as grids, clouds, or supercomputers. Therefore, the availability of inexpensive local resources is a great opportunity for tackling such problems. A main case in these disciplines is the mapping of sequences obtained from current DNA sequencers in a DNA genome reference of an organism. These sequences are called reads. In an experiment, there can be as many as 10 9 reads with an average length of 100 nucleotides. We need an effective solution for mapping billions of short reads. Mapping tools available in the literature focus on different approaches. Smith-Waterman [2], [3] and BLAST [4], [5] are based on the implicit creation of a matrix of weights that define the maximum likelihood between a reference string and the reads. Working with a matrix of weights is efficient when searching for homologies and is typically used for dissimilar long sequences. Conversely, these options are not effective when per- forming the alignment of short reads. State-of-the-art short- read alignment methods create either hash tables or search trees that speedup searching. One interesting approach employs the Burrows-Wheeler Transform (BWT) [6], to create a tree that reduces the computational complexity of the alignment process to an order of the length of the read. The BWT is an indexing method originally used in data compression techniques. The BWT process consists mainly on sorting all possible rotations of the reference and keeping the original positions of the rotations. In this way, suffixes are grouped together speeding up future searches. In any of the above cases, the alignment is an intensive task that requires high-capacity computing resources. Among the different available choices that provide faster computation models, General Purpose Graphic Processing Units (GPGPU) are a very cost-effective option. GPGPU architecture design enables these graphical devices to tackle general purpose problems efficiently, exploiting a micro- grain parallelism strongly bound by the memory hierarchy. In this paper, we design the implementation on GPGPUs using CUDA [7] of an exact alignment algorithm based on the BWT. Results reveal a very promising speedup with respect to the fastest available solutions on CPU (Bowtie [8], SOAP2 [9] and BWA [6]) and GPU (SOAP3 [10]) consider- ing the same conditions. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS, VOL. 9, NO. 4, JULY/AUGUST 2012 1245 . J. Salavert Torres, I. Blanquer Espert, A. Toma´s Domı´nguez, and V. Herna´ndez Garcı´a are with the Centro Mixto CSIC, Instituto de Instrumentacio´n para Imagen Molecular Valencia (I3M), Universitat Polite`cnica de Vale`ncia-CIEMAT, Camino de Vera s/n, Vale`ncia 46022, Spain. E-mail: josator@i3m.upv.es, {iblanque, atomas, vhernand}@dsic.upv.es. . I. Medina Castello´, J. Ta´rraga Gime´nez, and J. Dopazo Bla´zquez are with the Bioinformatics and Genomics Department, Centro de Investigacio´n Prı´ncipe Felipe (CIPF), and the Functional Genomics Node, INB, CIPF, Avda. Autopista del Saler, 16, Valencia 46012, Spain. E-mail: {imedina, jtarraga, jdopazo}@cipf.es. Manuscript received 4 July 2011; revised 16 Jan. 2012; accepted 28 Feb. 2012; published online 20 Mar. 2012. For information on obtaining reprints of this article, please send e-mail to: tcbb@computer.org, and reference IEEECS Log Number TCBB-2001-07-0174. Digital Object Identifier no. 10.1109/TCBB.2012.49. 1545-5963/12/$31.00 ß 2012 IEEE Published by the IEEE CS, CI, and EMB Societies & the ACM