Optimizing a MapReduce Module of Preprocessing High-Throughput DNA Sequencing Data Wei-Chun Chung #,1,3 , Yu-Jung Chang #,2 , Chien-Chih Chen 2 , Der-Tsai Lee 2,3,4 , Jan-Ming Ho §,1,2 1 Research Center for Information Technology Innovation 2 Institute of Information Science Academia Sinica Taipei, Taiwan, ROC E-mail: {wcchung, yjchang, rocky, dtlee, hoho}@iis.sinica.edu.tw Abstract—The MapReduce framework has become the de facto choice for big data analysis in a variety of applications. In MapReduce programming model, computation is distributed to a cluster of computing nodes that runs in parallel. The performance of a MapReduce application is thus affected by system and middleware, characteristics of data, and design and implementation of the algorithms. In this study, we focus on performance optimization of a MapReduce application, i.e., CloudRS, which tackles on the problem of detecting and removing errors in the next-generation sequencing de novo genomic data. We present three strategies, i.e., content- exchange, content-grouping, and index-only strategies, of communication between the Map() and Reduce() functions. The three strategies differ in the way messages are exchanged between the two functions. We also present experimental results to compare performance of the three strategies. Keywords-error correction; genome assembly; mapreduce; next-generation sequencing; optimization; I. INTRODUCTION MapReduce [1] is a prominent distributed computational framework that possesses various key features for dealing with large-scale data processing on the cloud [2-4], including fault-tolerance, scheduling, data replication, load balance, and parallelization. By virtue of the scalability and simplicity on development, MapReduce and its implementations [5-7] have been widely-used in different applications, e.g., Web and social networks analysis, scientific emulation, financial and business data processing, and bioinformatics [8-12]. However, the performance and efficiency of MapReduce are affected by different factors, and thus, become challenging for optimization. Optimizing MapReduce is essential as processing data in a timely and cost-efficient manner becomes critical [13-18]. Fortunately, various techniques have been introduced to improve the performance of MapReduce [19-25], including hardware, software, and framework level optimization. One of the optimization techniques is tuning parameters for system, middleware, and MapReduce execution by utilizing expert systems [20-22] or the rule-of-thumb policies [26, 27]. Another type of optimization focuses on the design of algorithm or the characteristics of data of the application [28, 29]. In this study, we focus on CloudRS [9], a MapReduce application for correcting errors in the next-generation sequencing (NGS) data. As the cost of DNA sequencing rapidly reduces [12], the accompanying growth of genome data results in unpredictable execution time, even if the data is processed by MapReduce. Thus, to optimize the performance of CloudRS, we evaluate three kinds of message generation and transmission approaches to reduce the communication cost of MapReduce: content-exchange, content-grouping, and index-only strategies. We also present the experimental results, and discuss the observation and limitation of our proposed strategies. II. BACKGROUND A. The MapReduce programming model The MapReduce programming model is composed of two primitive functions, Map and Reduce. The input data of a MapReduce program is a list of <key, value> pairs, and thus, the Map() function is applied to each pair and generate a set of intermediate pairs, e.g. <key, list(value)>. Then the Reduce() function is applied to each intermediate pair, process values of the list, and produce aggregated final results. Moreover, there are additional functions in the MapReduce execution model, e.g., shuffle and sort, to handle intermediate data. The shuffle function is applied on the Map side, and performs data exchange by key after Map(). Thus, data with the same key will be transmitted to a single Reduce() function. The sort function is launched on the Reduce side after data exchange. It sorts data by the key field to group all the pairs with the same key for further processing. B. The CloudRS algorithm The CloudRS algorithm [9] is implemented with multiple MapReduce rounds. It aims at conservatively correcting sequence errors to avoid yielding false decisions, and thus, improves the quality of de novo assembly. To correct a possible mismatch, CloudRS emulates read alignment and majority voting for each set of reads, denoted as a read stack, ─────────────────── # These authors contributed equally to this work (co-First authors). 3 Department of Computer Science and Information Engineering, National Taiwan University, Taipei, Taiwan, ROC. 4 Department of Computer Science and Information Engineering, National Chung Hsing University, Taichung, Taiwan, ROC. § Corresponding author 2013 IEEE International Conference on Big Data 1 978-1-4799-1293-3/13/$31.00 ©2013 IEEE