Extension of a Task-based model to Functional programming Lucas M. Ponce * , Daniele Lezzi † , Rosa M. Badia †‡ and Dorgival Guedes * * Universidade Federal de Minas Gerais (UFMG), Belo Horizonte, Brazil Email: {lucasmsp, dorgival}@dcc.ufmg.br † Barcelona Supercomputing Center - Centro Nacional de Supercomputaci´ on (BSC-CNS), Barcelona, Spain ‡ Spanish National Research Council (CSIC), Barcelona, Spain Email: {daniele.lezzi, rosa.m.badia}@bsc.es Abstract—Recently, efforts have been made to bring together the areas of high-performance computing (HPC) and massive data processing (Big Data). Traditional HPC frameworks, like COMPSs, are mostly task-based, while popular big-data envi- ronments, like Spark, are based on functional programming principles. The earlier are know for their good performance for regular, matrix-based computations; on the other hand, for ﬁne-grained, data-parallel workloads, the later has often been considered more successful. In this paper we present our experience with the integration of some dataﬂow techniques into COMPSs, a task-based framework, in an effort to bring together the best aspects of both worlds. We present our API, called DDF, which provides a new data abstraction that addresses the challenges of integrating Big Data application scenarios into COMPSs. DDF has a functional-based interface, similar to many Data Science tools, that allows us to use dynamic evaluation to adapt the task execution in runtime. Besides the performance optimization it provides, the API facilitates the development of applications by experts in the application domain. In this paper we evaluate DDF’s effectiveness by comparing the resulting programs to their original versions in COMPSs and Spark. The results show that DDF can improve COMPSs execution time and even outperform Spark in many use cases. Index Terms—COMPSs, Big Data, Performance Evaluation, DataFlow Programming I. I NTRODUCTION Convergence between high-performance computing (HPC) and Big Data has become an important research area, driven in part by the need to incorporate high-level libraries, platforms, and algorithms for machine learning and graph processing, and in part by the idea of using Big Data’s ﬁne-grained data awareness to increase the productivity of HPC systems [1], [2]. Several proposals of higher-level abstractions have emerged to address the requirements of these two areas in com- puter systems [3], [4]. Recent frameworks, like COMPSs [3], Twister2 [4], Spark [5] and Flink [6], share a common dataﬂow programming model, but are still focused on a single area. Dataﬂow is a special case of task-based models where an application can be represented as a directed acyclic graph (DAG), with nodes representing computational steps and edges indicating communication between nodes. The computation at a node is activated when its inputs (events, task data) become available. A well-designed dataﬂow framework hides low- level operational details, such as communication, concurrency control, and disk I/O, from the users developing parallel applications, allowing them to focus on the application itself. While sharing that common model, each framework has its own abstraction and run time system, which are generally related to its original environment. Traditionally, HPC environ- ments provide an interface through User-Deﬁned Functions, which gives freedom for users to write their applications by deﬁning its tasks. For instance, in COMPSs, an MPI-based framework commonly used in HPC scenarios, applications are written following the sequential paradigm with the addition of annotations in the code that are used to inform that a given method is a task and what are its inputs and outputs. Such frameworks are commonly used in scientiﬁc algorithms such as matrix computations. Despite the good performance in those scenarios, it is often hard to implement optimized applications that handle irregular data and complex data ﬂows, such as those commonly found in machine learning and data mining areas. The process of transmitting large volumes of input data to tasks has a high cost in many HPC systems, specially when that data is the output of a previous task, as it is the case in COMPSs, because it involves data serialization and deserialization steps. Because of that, a common practice adopted by advanced programmers is to minimize the number of different tasks by joining the code of multiple functions in a single task. This is a challenge when black-box libraries of parallel algorithms are used, because, depending on the ﬂow of operations, it might be necessary to join the code of different functions in order to obtain better performance, but that code would not be available. On the other hand, recent Big Data environments have adopted functional languages [5], [6] as a form to express their data abstractions and ﬂows. Those frameworks implement a set of common operations and basic algorithms to facilitate the development of applications by experts in the applica- tion domain. However, some research [7], [8] shows that, depending on the application (e.g., matrix computations), those frameworks achieve good scalability, but poor performance when compared with an MPI implementation. In that context, our work discusses our experience in bring- ing together the programming model of COMPSs and the functional programming abstractions usually found in frame- © 2019 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes,creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works. DOI 10.1109/SBAC-PAD.2019.00023