Scheduling Cyclic Task Graphs with SCC-Map
Alexandre Sardinha
∗
, Tiago A. O. Alves
∗
, Leandro A. J. Marzulo
†
,
Felipe M. G. Franc ¸a
∗
, Valmir C. Barbosa
∗
and V´ ıtor Santos Costa
‡
∗
Universidade Federal do Rio de Janeiro
Programa de Engenharia de Sistemas e Computac ¸˜ ao, COPPE, Rio de Janeiro, RJ, Brasil
Email:{sardinha, tiagoaoa, felipe, valmir}@cos.ufrj.br
†
Universidade do Estado do Rio de Janeiro
Instituto de Matem´ atica e Estat´ ıstica, Departamento de Inform´ atica e Ciˆ encia da Computac ¸˜ ao, Rio de Janeiro, RJ, Brasil
Email: leandro@ime.uerj.br
‡
Universidade do Porto
Departamento de Ciˆ encia de Computadores, Porto, Portugal
Email: vsc@dcc.fc.up.pt
Abstract—The Dataflow execution model has been shown to
be a good way of exploiting Thread-Level Parallelism (TLP),
making parallel programming easier. In this model, tasks must
be mapped to processing elements (PEs) considering the trade-
off between communication and parallelism. Previous work on
scheduling dependency graphs have mostly focused on directed
acyclic graphs, which are not suitable for dataflow (loops in
the code become cycles in the graph). Thus, we present the
SCC-Map: a novel static mapping algorithm that considers the
importance of cycles during the mapping process. To validate
our approach, we ran a set of benchmarks using our dataflow
simulator varying the communication latency, the number of PEs
in the system and the placement algorithm. Our results show that
the benchmark programs run significantly faster when mapped
with SCC-Map. Moreover, we observed that SCC-Map is more
effective than the other mapping algorithms when communication
latency is higher.
I. I NTRODUCTION
Recent work has pointed at the Dataflow execution model as
a good alternative to exploit thread-level parallelism [1]–[3].
In the Dataflow model, programs can be described as a graph,
where nodes represents Instructions (or tasks) and edges in
the graph describe their dependencies. Execution is guided by
the dataflow firing rule: instructions are fired as soon as all of
their input operands are ready (i.e., all of their parents have
completed).
A problem that arises from this strategy is the need to
map instructions to processing elements (PEs). Once you have
described the program dependencies in a dataflow graph, you
must decide where each instruction will be placed (i.e., which
available PE will execute which instruction). A good mapping
must balance the fact that the more instructions are spread
among PEs, the more parallelism will be available, but also the
more communication overhead one will have. Therefore, an
ideal scheduling strategy must aim at obtaining a good trade-
off between communication and parallelism. Throughout this
work, we use the terms “map” and “schedule” interchangeably.
Previous work on scheduling dependency graphs have
mostly focused on DAGs (directed acyclic graphs), with good
results being achieved for statical (or offline) mapping [4]–
[8]. However, mapping algorithms for DAGs are not suitable
for dataflow graphs, since dataflow programs often contain
cycles corresponding to the loops in the code. We present the
SCC-Map: a new static mapping algorithm for dependency
graphs that contain cycles (namely, for dataflow graphs). Our
proposal is based on the work of Boyer and Hura [6], which
was aimed at DAGs. We further compare our novel approach
towards dataflow graphs with other algorithms, such as the
ones presented in [6], [7].
In order to validate our ideas, we developed a cycle-by-
cycle dataflow simulator, allowing us to investigate in detail
the effects of each mapping strategy. Then, we compiled a set
of benchmarks to run on the simulator varying the placement
algorithm, communication latency, and the number of available
processing elements in the system. Mappings were obtained
by a set of reference algorithms and SCC-Map. We compared
the speedups in all scenarios, taking as baseline the serial
execution, i.e., the case where all instructions mapped to
the same PE. Moreover we provide a theoretical maximum
speedup that could be achieved for each application. This value
was obtained by placing each instruction of the program in a
distinct PE and setting the communication latency to 1 clock
cycle (minimum possible latency).
Our results show that, for most of the tested scenarios, our
set of programs run significantly faster with the instruction-
to-PE mappings obtained with our algorithm. Moreover, we
observed that SCC-Map is more effective than the other
mapping algorithms when communication latency is higher
in the system.
The rest of this paper is organized as follows: Section II
discusses the relevance of the mapping problem in the context
of dataflow systems and presents TALM (the dataflow model
used as basis to build the simulator to test SCC-Map); in
Section III we discuss some related works; Section IV presents
SCC-Map (our mapping algorithm); results are presented and
discussed in Section V; we conclude and indicate possible
future works in Section VI.
2012 Third Workshop on Applications for Multi-Core Architecture
978-0-7695-4916-3/12 $26.00 © 2012 IEEE
DOI 10.1109/WAMCA.2012.8
54