Optimizing a MapReduce Module of Preprocessing High-Throughput DNA
Sequencing Data
Wei-Chun Chung
#,1,3
, Yu-Jung Chang
#,2
, Chien-Chih Chen
2
, Der-Tsai Lee
2,3,4
, Jan-Ming Ho
§,1,2
1
Research Center for Information Technology Innovation
2
Institute of Information Science
Academia Sinica
Taipei, Taiwan, ROC
E-mail: {wcchung, yjchang, rocky, dtlee, hoho}@iis.sinica.edu.tw
Abstract—The MapReduce framework has become the de facto
choice for big data analysis in a variety of applications. In
MapReduce programming model, computation is distributed
to a cluster of computing nodes that runs in parallel. The
performance of a MapReduce application is thus affected by
system and middleware, characteristics of data, and design and
implementation of the algorithms. In this study, we focus on
performance optimization of a MapReduce application, i.e.,
CloudRS, which tackles on the problem of detecting and
removing errors in the next-generation sequencing de novo
genomic data. We present three strategies, i.e., content-
exchange, content-grouping, and index-only strategies, of
communication between the Map() and Reduce() functions.
The three strategies differ in the way messages are exchanged
between the two functions. We also present experimental
results to compare performance of the three strategies.
Keywords-error correction; genome assembly; mapreduce;
next-generation sequencing; optimization;
I. INTRODUCTION
MapReduce [1] is a prominent distributed computational
framework that possesses various key features for dealing
with large-scale data processing on the cloud [2-4], including
fault-tolerance, scheduling, data replication, load balance,
and parallelization. By virtue of the scalability and simplicity
on development, MapReduce and its implementations [5-7]
have been widely-used in different applications, e.g., Web
and social networks analysis, scientific emulation, financial
and business data processing, and bioinformatics [8-12].
However, the performance and efficiency of MapReduce are
affected by different factors, and thus, become challenging
for optimization.
Optimizing MapReduce is essential as processing data in
a timely and cost-efficient manner becomes critical [13-18].
Fortunately, various techniques have been introduced to
improve the performance of MapReduce [19-25], including
hardware, software, and framework level optimization. One
of the optimization techniques is tuning parameters for
system, middleware, and MapReduce execution by utilizing
expert systems [20-22] or the rule-of-thumb policies [26, 27].
Another type of optimization focuses on the design of
algorithm or the characteristics of data of the application [28,
29].
In this study, we focus on CloudRS [9], a MapReduce
application for correcting errors in the next-generation
sequencing (NGS) data. As the cost of DNA sequencing
rapidly reduces [12], the accompanying growth of genome
data results in unpredictable execution time, even if the data
is processed by MapReduce. Thus, to optimize the
performance of CloudRS, we evaluate three kinds of
message generation and transmission approaches to reduce
the communication cost of MapReduce: content-exchange,
content-grouping, and index-only strategies. We also present
the experimental results, and discuss the observation and
limitation of our proposed strategies.
II. BACKGROUND
A. The MapReduce programming model
The MapReduce programming model is composed of
two primitive functions, Map and Reduce. The input data of
a MapReduce program is a list of <key, value> pairs, and
thus, the Map() function is applied to each pair and generate
a set of intermediate pairs, e.g. <key, list(value)>. Then the
Reduce() function is applied to each intermediate pair,
process values of the list, and produce aggregated final
results. Moreover, there are additional functions in the
MapReduce execution model, e.g., shuffle and sort, to handle
intermediate data. The shuffle function is applied on the Map
side, and performs data exchange by key after Map(). Thus,
data with the same key will be transmitted to a single
Reduce() function. The sort function is launched on the
Reduce side after data exchange. It sorts data by the key field
to group all the pairs with the same key for further
processing.
B. The CloudRS algorithm
The CloudRS algorithm [9] is implemented with multiple
MapReduce rounds. It aims at conservatively correcting
sequence errors to avoid yielding false decisions, and thus,
improves the quality of de novo assembly. To correct a
possible mismatch, CloudRS emulates read alignment and
majority voting for each set of reads, denoted as a read stack,
───────────────────
#
These authors contributed equally to this work (co-First authors).
3
Department of Computer Science and Information Engineering, National
Taiwan University, Taipei, Taiwan, ROC.
4
Department of Computer Science and Information Engineering, National
Chung Hsing University, Taichung, Taiwan, ROC.
§
Corresponding author
2013 IEEE International Conference on Big Data
1
978-1-4799-1293-3/13/$31.00 ©2013 IEEE