RE-PAGE: Domain-Speciﬁc REplication and PArallel Processing of GEnomic Data Mucahid Kutlu Department of Computer Science and Engineering Ohio State University Columbus, OH, 43210 Email: kutlu@cse.ohio-state.edu Gagan Agrawal Department of Computer Science and Engineering Ohio State University Columbus, OH, 43210 Email: agrawal@cse.ohio-state.edu Abstract—As development of high-throughput and low-cost sequencing technologies is leading to massive volumes of genomic data, new solutions for handling data-intensive applications on parallel platforms are urgently required. Particularly, the nature of processing leads to both load balancing and I/O contention challenges. In this paper, we have developed a novel middleware system, RE-PAGE, which allows parallelization of applications that process genomic data with a simple, high-level API. To address load balancing and I/O contention, the features of the middleware include: 1) use of domain-speciﬁc information in the formation of data chunks (which can be of non-uniform sizes), 2) replication and placement of each chunk on a small number of nodes, performed in an intelligent way, and 3) scheduling schemes for achieving load balance, when data movement costs out-weigh processing costs and the chunks are of non-uniform sizes. We have evaluated our framework using three genomic applications, which are VarScan, Uniﬁed Genotyper, and Coverage Analyzer. We show that our approach leads to better performance than conventional MapReduce scheduling approaches and systems that access data from a centralized store. We also compare against popular frameworks, Hadoop and GATK, and show that our middleware outperforms both, achieving high parallel efﬁciency and scalability. I. I NTRODUCTION As the amount of available genomic data increases, analysis of such data is becoming increasingly important for medi- cal research and even practice. By the development of new sequencing and alignment technologies, genetic data can be obtained at a much faster rate and at a lower cost ([1], [2]). There is a growing interest in analysis of genomic data (for example, see 1, 2 ). One noteworthy project is the 1000 Human Genome Project 3 , which has already produced 200 TB of genomic data across 1700 samples, and made it available on the Amazon Cloud Storage 4 . Exploiting parallelism and effective utilization of comput- ing resources has become inevitable in order to analyze this data efﬁciently. The current state-of-the-art in parallel analysis of genomic data is very limited. Many studies have reported implementations of a speciﬁc algorithm ([9], [29], [6], [28], [35]). General MapReduce frameworks such as Hadoop [34] can potentially be used, but its limited programmability does 1 genome.ucsc.edu 2 http://www.ncbi.nlm.nih.gov/genome 3 www.1000genomes.org 4 http://aws.amazon.com/1000genomes/ not allow us to take advantage of characteristics of genomic data and algorithms to increase performance of parallel ex- ecutions. Genome Analysis Tool Kit (GATK) [22] is a pop- ular framework that provides a platform for the users to develop parallel genomic applications easily. However, as has been shown, it has signiﬁcant performance limitations [15]. PAGE [15] is a middleware that parallelizes executables of genomic applications, providing an easy way of parallelization for a wide range of genomic applications. However, nodes access the data using the network ﬁle system and therefore, the applications suffer from an I/O contention problem. In this paper, we focus on both performance and pro- grammability challenges associated with processing of ge- nomic data. Popular data-intensive middleware systems such as Hadoop are based on dividing data into chunks of uniform size. We show why this is not desirable for genomic data, and present a chunk formation scheme that uses domain information. Next, we address the problem of load balancing the computation, when chunks are of non-uniform sizes, and the cost of data movement outweighs processing times. We present a data replication scheme and a set of task scheduling methods that do not use remote data to address the load balancing problem. We avoid data movement across nodes, or accesses to a central ﬁle system, and do not require full replication of all input data. The three new scheduling schemes we have designed are largest chunk ﬁrst, busiest node ﬁrst, and process memory-resident data. To address the programmability challenges, these tech- niques are implemented in a middleware system, RE-PAGE, which eases parallelization of data intensive genomic appli- cations while achieving high parallel efﬁciency. Speciﬁcally, RE-PAGE is a MapReduce-like middleware system, where the map and reduce programs are separate executables written in any programming language. Thus, RE-PAGE allows users to continue to utilize their existing genomic tools, but achieve parallelization with a modest additional effort. We evaluate RE-PAGE with three applications, which are VarScan [13], Uniﬁed Genotyper [3] and a coverage analyzer software which is a modiﬁed version of depth feature of SamTools [19]. We compare our scheduling approach against a conventional MapReduce scheduling scheme and show that our approach performs better. We also achieve signiﬁcant im- provements over implementations where all nodes access data from a centralized server. Finally, we compare the performance of RE-PAGE against GATK and Hadoop and show that RE- 2015 IEEE International Conference on Cluster Computing 978-1-4673-6598-7 2015 U.S. Government Work Not Protected by U.S. Copyright DOI 10.1109/CLUSTER.2015.54 332 2015 IEEE International Conference on Cluster Computing 978-1-4673-6598-7 2015 U.S. Government Work Not Protected by U.S. Copyright DOI 10.1109/CLUSTER.2015.54 332 2015 IEEE International Conference on Cluster Computing 978-1-4673-6598-7 2015 U.S. Government Work Not Protected by U.S. Copyright DOI 10.1109/CLUSTER.2015.54 332 2015 IEEE International Conference on Cluster Computing 978-1-4673-6598-7 2015 U.S. Government Work Not Protected by U.S. Copyright DOI 10.1109/CLUSTER.2015.54 332 2015 IEEE International Conference on Cluster Computing 978-1-4673-6598-7 2015 U.S. Government Work Not Protected by U.S. Copyright DOI 10.1109/CLUSTER.2015.54 332 2015 IEEE International Conference on Cluster Computing 978-1-4673-6598-7 2015 U.S. Government Work Not Protected by U.S. Copyright DOI 10.1109/CLUSTER.2015.54 332 2015 IEEE International Conference on Cluster Computing 978-1-4673-6598-7 2015 U.S. Government Work Not Protected by U.S. Copyright DOI 10.1109/CLUSTER.2015.54 332 2015 IEEE International Conference on Cluster Computing 978-1-4673-6598-7 2015 U.S. Government Work Not Protected by U.S. Copyright DOI 10.1109/CLUSTER.2015.54 332 2015 IEEE International Conference on Cluster Computing 978-1-4673-6598-7 2015 U.S. Government Work Not Protected by U.S. Copyright DOI 10.1109/CLUSTER.2015.54 332