Extension of a Task-based model to
Functional programming
Lucas M. Ponce
*
, Daniele Lezzi
†
, Rosa M. Badia
†‡
and Dorgival Guedes
*
*
Universidade Federal de Minas Gerais (UFMG), Belo Horizonte, Brazil
Email: {lucasmsp, dorgival}@dcc.ufmg.br
†
Barcelona Supercomputing Center - Centro Nacional de Supercomputaci´ on (BSC-CNS), Barcelona, Spain
‡
Spanish National Research Council (CSIC), Barcelona, Spain
Email: {daniele.lezzi, rosa.m.badia}@bsc.es
Abstract—Recently, efforts have been made to bring together
the areas of high-performance computing (HPC) and massive
data processing (Big Data). Traditional HPC frameworks, like
COMPSs, are mostly task-based, while popular big-data envi-
ronments, like Spark, are based on functional programming
principles. The earlier are know for their good performance
for regular, matrix-based computations; on the other hand,
for fine-grained, data-parallel workloads, the later has often
been considered more successful. In this paper we present our
experience with the integration of some dataflow techniques
into COMPSs, a task-based framework, in an effort to bring
together the best aspects of both worlds. We present our API,
called DDF, which provides a new data abstraction that addresses
the challenges of integrating Big Data application scenarios into
COMPSs. DDF has a functional-based interface, similar to many
Data Science tools, that allows us to use dynamic evaluation to
adapt the task execution in runtime. Besides the performance
optimization it provides, the API facilitates the development
of applications by experts in the application domain. In this
paper we evaluate DDF’s effectiveness by comparing the resulting
programs to their original versions in COMPSs and Spark. The
results show that DDF can improve COMPSs execution time and
even outperform Spark in many use cases.
Index Terms—COMPSs, Big Data, Performance Evaluation,
DataFlow Programming
I. I NTRODUCTION
Convergence between high-performance computing (HPC)
and Big Data has become an important research area, driven in
part by the need to incorporate high-level libraries, platforms,
and algorithms for machine learning and graph processing,
and in part by the idea of using Big Data’s fine-grained data
awareness to increase the productivity of HPC systems [1], [2].
Several proposals of higher-level abstractions have emerged
to address the requirements of these two areas in com-
puter systems [3], [4]. Recent frameworks, like COMPSs [3],
Twister2 [4], Spark [5] and Flink [6], share a common dataflow
programming model, but are still focused on a single area.
Dataflow is a special case of task-based models where an
application can be represented as a directed acyclic graph
(DAG), with nodes representing computational steps and edges
indicating communication between nodes. The computation at
a node is activated when its inputs (events, task data) become
available. A well-designed dataflow framework hides low-
level operational details, such as communication, concurrency
control, and disk I/O, from the users developing parallel
applications, allowing them to focus on the application itself.
While sharing that common model, each framework has
its own abstraction and run time system, which are generally
related to its original environment. Traditionally, HPC environ-
ments provide an interface through User-Defined Functions,
which gives freedom for users to write their applications by
defining its tasks. For instance, in COMPSs, an MPI-based
framework commonly used in HPC scenarios, applications are
written following the sequential paradigm with the addition of
annotations in the code that are used to inform that a given
method is a task and what are its inputs and outputs. Such
frameworks are commonly used in scientific algorithms such
as matrix computations. Despite the good performance in those
scenarios, it is often hard to implement optimized applications
that handle irregular data and complex data flows, such as
those commonly found in machine learning and data mining
areas.
The process of transmitting large volumes of input data
to tasks has a high cost in many HPC systems, specially
when that data is the output of a previous task, as it is
the case in COMPSs, because it involves data serialization
and deserialization steps. Because of that, a common practice
adopted by advanced programmers is to minimize the number
of different tasks by joining the code of multiple functions in
a single task. This is a challenge when black-box libraries of
parallel algorithms are used, because, depending on the flow of
operations, it might be necessary to join the code of different
functions in order to obtain better performance, but that code
would not be available.
On the other hand, recent Big Data environments have
adopted functional languages [5], [6] as a form to express their
data abstractions and flows. Those frameworks implement a
set of common operations and basic algorithms to facilitate
the development of applications by experts in the applica-
tion domain. However, some research [7], [8] shows that,
depending on the application (e.g., matrix computations), those
frameworks achieve good scalability, but poor performance
when compared with an MPI implementation.
In that context, our work discusses our experience in bring-
ing together the programming model of COMPSs and the
functional programming abstractions usually found in frame-
© 2019 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or
future media, including reprinting/republishing this material for advertising or promotional purposes,creating new collective works, for
resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works.
DOI 10.1109/SBAC-PAD.2019.00023