PAGE: A Framework for Easy PArallelization of GEnomic Applications Mucahid Kutlu Department of Computer Science and Engineering Ohio State University Columbus, OH, 43210 Email: kutlu@cse.ohio-state.edu Gagan Agrawal Department of Computer Science and Engineering Ohio State University Columbus, OH, 43210 Email: agrawal@cse.ohio-state.edu Abstract—With the availability of high-throughput and low- cost sequencing technologies, an increasing amount of genetic data is becoming available to researchers. There is clearly a potential for significant new scientific and medical advances by analysis of such data, however, it is imperative to exploit paral- lelism and achieve effective utilization of the computing resources to be able to handle massive datasets. Thus, frameworks that can help researchers develop parallel applications without dealing with low-level details of parallel coding are very important for advances in genetic research. In this study, we develop a middleware, PAGE, which supports ‘mapreduce-like’ processing, but with significant differences from a system like Hadoop, to be useful and effective for parallelizing analysis of genomic data. Particularly, it can work with map functions written in any language, thus allowing utilization of existing serial tools (even those for which only an executable is available) as map functions. Thus, it can greatly simplify parallel application development for scenarios where complex data formats and/or nuanced serial algorithms are involved, as is often the case for genomic data. It allows parallelization by partitioning by-locus or partitioning by-chromosome, provides different scheduling schemes, and execution models, to match the nature of algorithms common in genetic research. We have evaluated the middleware system using four popular genomic applications, including VarScan, Unified Genotyper, Realigner Target Creator, and Indel Realigner, and compared the achieved performance against with two popular frameworks (Hadoop and GATK). We show that our middleware outperforms GATK and Hadoop and it is able to achieve high parallel efficiency and scalability. I. I NTRODUCTION Analysis of genetic data is becoming increasingly critical for medical research and even practice. Trends in sequenc- ing technologies have drastically reduced the cost of, and increased the speed of, collecting gene sequences [1], [6]. This data is now being shared aggressively through various projects, and researchers at different institutions can download and analyze the data. An example of such efforts is the 1000 Human Genome Project 1 , which has already produced 200TB of genome data across 1700 samples, and made it available on the Amazon Cloud Storage 2 . As a large number of researchers have access to the data, the focus is beginning to shift to analysis of this data. There exist many tools [15], [24], [30], [34] and libraries [4] to 1 www.1000genomes.org 2 http://aws.amazon.com/1000genomes/ ease the implementation of new data analysis programs on genetic data. One of the complexities in this area is that data formats in which sequences are stored are very specialized. Existing tools, in most cases, alleviate the need for scientists understanding these formats and coding complex algorithms. Consistent with the overall trend in data analytics, use of parallelism is inevitable in analysis of genomic data also. Low- ered response time, including facilitating interactive analysis of large-scale data, can open us many novel opportunities for researchers. Again, just like all other fields, the difficulty of carrying-out parallel implementations, especially for domain experts, is a large obstacle to widespread use of parallelism for analysis of genomic data. The current state-of-the-art in parallel analysis of genomic data is very limited. There are many serial software suites that lack any parallelization capability[19], [23], [24], [26], [30]. General MapReduce frameworks such as Hadoop [36] can potentially be used, but even parallelization using such a framework can be a hard problem for the community [20], [29]. Particularly, developing a parallel implementation with the use of MapReduce will require knowledge of data formats and reprogramming of an existing serial program, neither of which is desirable. Load imbalance is another problem while processing genomic data, which requires nuanced solutions. Genome Analysis Tool Kit (GATK)[27] provides developers a MapReduce-like framework specific for analysis of genomic data. However, GATK does not provide distributed memory parallelization for most of its programs and its scalability is low, as we will show in Section VI. Thus, we can see that easing parallelization of applications that process genomic data, while also achieving high parallel efficiency, is a challenging problem. In this study, we are proposing PAGE, which is a MapReduce-like middleware that allows users to continue to utilize their existing programs, but achieve parallelization with a modest additional effort. The map and reduce programs can be separate executables written in any (and even two distinct) programming languages. We can also work with any of the popular genomic data formats. The middleware divides the genome structure into regions/intervals and runs each map task for a different region, performing load balancing in the process. Subsequently, these partial results are reduced to obtain the final result(s). In order to improve efficiency, our middleware provides streaming task scheduling, in which the tasks can be scheduled dynamically with the help of a master node, and the reduce tasks do not have to wait for all map tasks to be finished. We evaluate PAGE with 4 existing applications, which are !014 IEEE !8th International Parallel & Distributed Processing Symposium !530-2075 20!4 U.S. Government Work Not Protected by U.S. Copyright DOI !0.!!09/IPDPS.20!4.!9 72