The Journal of Supercomputing
https://doi.org/10.1007/s11227-018-2578-0
Load balancing in join algorithms for skewed data
in MapReduce systems
Elaheh Gavagsaz
1
· Ali Rezaee
1
· Hamid Haj Seyyed Javadi
2
© Springer Science+Business Media, LLC, part of Springer Nature 2018
Abstract
Join is an essential tool for data analysis which collected from different data sources.
MapReduce has emerged as a prominent programming model for processing of mas-
sive data. However, traditional join algorithms based on MapReduce are not efficient
when handling skewed data. The presence of data skew in input data leads to con-
siderable load imbalance and performance degradation. This paper proposes a new
skew-insensitive method, called fine-grained partitioning for skew data (FGSD) which
can improve the load balancing for reduce tasks. The proposed method considers the
properties of both input and output data through a proposed stream sampling algo-
rithm. FGSD introduces a new approach for distribution of input data which leads
to efficiently handling redistribution and join product skew. The experimental results
confirm that our solution can not only achieve higher balancing performance, but also
reduce the execution time of a job with varying degrees of the data skew. Further-
more, FGSD does not require any modification to the MapReduce environment and is
applicable to complex join.
Keywords Load balancing · Join algorithm · Data skew · MapReduce · Spark
B Ali Rezaee
AliRezaee@srbiau.ac.ir; alirezaee.uni@gmail.com
Elaheh Gavagsaz
egavagsaz@yahoo.com
Hamid Haj Seyyed Javadi
h.s.javadi@shahed.ac.ir
1
Department of Computer Engineering, Science and Research Branch, Islamic Azad University,
Tehran, Iran
2
Department of Applied Mathematics, Faculty of Mathematics and Computer Science, Shahed
University, Tehran, Iran
123