Please cite this article in press as: V. Dias, et al., Janus: Diagnostics and reconfiguration of data parallel programs, J. Parallel Distrib. Comput. (2018), https://doi.org/10.1016/j.jpdc.2018.02.030. J. Parallel Distrib. Comput. ( ) – Contents lists available at ScienceDirect J. Parallel Distrib. Comput. journal homepage: www.elsevier.com/locate/jpdc Janus: Diagnostics and reconfiguration of data parallel programs Vinicius Dias *, Wagner Meira Jr., Dorgival Guedes Department of Computer Science, Universidade Federal de Minas Gerais (UFMG), Brazil highlights • Performance diagnosis dimensions are proposed as an evaluation methodology. • Applications representing common communication patterns are characterized. • An extensible tool is proposed for reconfiguration of Spark applications. article info Article history: Received 25 April 2017 Received in revised form 14 November 2017 Accepted 26 February 2018 Available online xxxx MSC: 68M14 68M20 Keywords: Performance diagnosis Dynamic reconfiguration Spark abstract The increasing amount of data being stored and the variety of algorithms proposed to meet processing de- mands of the data scientists have led to a new generation of computational environments and paradigms. These environments simplify the task of programmers, but achieving the ideal performance continues to be a challenge. In this work we investigate important factors concerning the performance of common big-data applications and consider the Spark framework as the target for our contributions. Based on that, we present the design and implementation of Janus, a tool that automates the reconfiguration of Spark applications. It leverages logs from previous executions as input, enforces configurable adjustment policies over the collected statistics and makes its decisions taking into account communication behaviors specific of the application evaluated. In order to accomplish that, Janus identifies global parameters that should be updated, or points in the user program where the data partitioning can be adjusted based on those policies. Our results show gains of up to 1.9× in the scenarios considered. © 2018 Elsevier Inc. All rights reserved. 1. Introduction The evolution of areas like data mining, machine learning, and data analytics and the increased availability of data sources has led to the rise of Data Science as a new way of processing and extracting value from large amounts of data. One strategy that has become popular to process such data is the use of data-parallel frameworks, like Hadoop and Spark. They provide data scientists, or domain experts, with tools that offer high-level abstractions to express complex data processing algorithms in a way that can be parallelized to a large number or machines, but without requiring them to express nor to handle low-level parallelism tasks. Given its wide acceptance in current big-data scenarios, in this work we consider Spark in our analysis. Based on the algorithm description provided by the user in such high-level abstractions, it is the task of the programming environment to find the best configuration to maximize execution performance with good resource usage. That configuration is often derived from information about the data to be processed and the * Corresponding author. E-mail address: viniciusvdias@dcc.ufmg.br (V. Dias). operations to be performed. However, in some cases the solutions proposed may not be the best ones. Considering that data science models and algorithms are irregular and intensive in terms of both computation and com- munication, performance diagnosis of parallel applications in en- vironments such as Spark is quite a challenge. Finding the right partition for the data at each stage of execution, balancing load during execution and adjusting the environment as application behavior changes are all difficult tasks to be performed by the execution framework. In this work we present Janus, 1 a tool that can automate the reconfiguration of Spark applications to allow the framework to achieve better performance for each application. To do that, first we present the characterization of traditional data mining algo- rithms through three different massive data-parallel applications that we believe represent most of the algorithm patterns found in the area. Next we describe our tool for adaptive reconfiguration 1 Janus is an ancient Roman god, which frequently symbolized change and transitions such as the progress of past to future. https://doi.org/10.1016/j.jpdc.2018.02.030 0743-7315/© 2018 Elsevier Inc. All rights reserved.