Please cite this article in press as: V. Dias, et al., Janus: Diagnostics and reconfiguration of data parallel programs, J. Parallel Distrib. Comput. (2018),
https://doi.org/10.1016/j.jpdc.2018.02.030.
J. Parallel Distrib. Comput. ( ) –
Contents lists available at ScienceDirect
J. Parallel Distrib. Comput.
journal homepage: www.elsevier.com/locate/jpdc
Janus: Diagnostics and reconfiguration of data parallel programs
Vinicius Dias *, Wagner Meira Jr., Dorgival Guedes
Department of Computer Science, Universidade Federal de Minas Gerais (UFMG), Brazil
highlights
• Performance diagnosis dimensions are proposed as an evaluation methodology.
• Applications representing common communication patterns are characterized.
• An extensible tool is proposed for reconfiguration of Spark applications.
article info
Article history:
Received 25 April 2017
Received in revised form 14 November
2017
Accepted 26 February 2018
Available online xxxx
MSC:
68M14
68M20
Keywords:
Performance diagnosis
Dynamic reconfiguration
Spark
abstract
The increasing amount of data being stored and the variety of algorithms proposed to meet processing de-
mands of the data scientists have led to a new generation of computational environments and paradigms.
These environments simplify the task of programmers, but achieving the ideal performance continues
to be a challenge. In this work we investigate important factors concerning the performance of common
big-data applications and consider the Spark framework as the target for our contributions. Based on
that, we present the design and implementation of Janus, a tool that automates the reconfiguration of
Spark applications. It leverages logs from previous executions as input, enforces configurable adjustment
policies over the collected statistics and makes its decisions taking into account communication behaviors
specific of the application evaluated. In order to accomplish that, Janus identifies global parameters that
should be updated, or points in the user program where the data partitioning can be adjusted based on
those policies. Our results show gains of up to 1.9× in the scenarios considered.
© 2018 Elsevier Inc. All rights reserved.
1. Introduction
The evolution of areas like data mining, machine learning, and
data analytics and the increased availability of data sources has
led to the rise of Data Science as a new way of processing and
extracting value from large amounts of data. One strategy that has
become popular to process such data is the use of data-parallel
frameworks, like Hadoop and Spark. They provide data scientists,
or domain experts, with tools that offer high-level abstractions to
express complex data processing algorithms in a way that can be
parallelized to a large number or machines, but without requiring
them to express nor to handle low-level parallelism tasks. Given
its wide acceptance in current big-data scenarios, in this work we
consider Spark in our analysis.
Based on the algorithm description provided by the user in
such high-level abstractions, it is the task of the programming
environment to find the best configuration to maximize execution
performance with good resource usage. That configuration is often
derived from information about the data to be processed and the
*
Corresponding author.
E-mail address: viniciusvdias@dcc.ufmg.br (V. Dias).
operations to be performed. However, in some cases the solutions
proposed may not be the best ones.
Considering that data science models and algorithms are
irregular and intensive in terms of both computation and com-
munication, performance diagnosis of parallel applications in en-
vironments such as Spark is quite a challenge. Finding the right
partition for the data at each stage of execution, balancing load
during execution and adjusting the environment as application
behavior changes are all difficult tasks to be performed by the
execution framework.
In this work we present Janus,
1
a tool that can automate the
reconfiguration of Spark applications to allow the framework to
achieve better performance for each application. To do that, first
we present the characterization of traditional data mining algo-
rithms through three different massive data-parallel applications
that we believe represent most of the algorithm patterns found in
the area. Next we describe our tool for adaptive reconfiguration
1
Janus is an ancient Roman god, which frequently symbolized change and
transitions such as the progress of past to future.
https://doi.org/10.1016/j.jpdc.2018.02.030
0743-7315/© 2018 Elsevier Inc. All rights reserved.