Scheduling Cyclic Task Graphs with SCC-Map Alexandre Sardinha , Tiago A. O. Alves , Leandro A. J. Marzulo , Felipe M. G. Franc ¸a , Valmir C. Barbosa and V´ ıtor Santos Costa Universidade Federal do Rio de Janeiro Programa de Engenharia de Sistemas e Computac ¸˜ ao, COPPE, Rio de Janeiro, RJ, Brasil Email:{sardinha, tiagoaoa, felipe, valmir}@cos.ufrj.br Universidade do Estado do Rio de Janeiro Instituto de Matem´ atica e Estat´ ıstica, Departamento de Inform´ atica e Ciˆ encia da Computac ¸˜ ao, Rio de Janeiro, RJ, Brasil Email: leandro@ime.uerj.br Universidade do Porto Departamento de Ciˆ encia de Computadores, Porto, Portugal Email: vsc@dcc.fc.up.pt Abstract—The Dataflow execution model has been shown to be a good way of exploiting Thread-Level Parallelism (TLP), making parallel programming easier. In this model, tasks must be mapped to processing elements (PEs) considering the trade- off between communication and parallelism. Previous work on scheduling dependency graphs have mostly focused on directed acyclic graphs, which are not suitable for dataflow (loops in the code become cycles in the graph). Thus, we present the SCC-Map: a novel static mapping algorithm that considers the importance of cycles during the mapping process. To validate our approach, we ran a set of benchmarks using our dataflow simulator varying the communication latency, the number of PEs in the system and the placement algorithm. Our results show that the benchmark programs run significantly faster when mapped with SCC-Map. Moreover, we observed that SCC-Map is more effective than the other mapping algorithms when communication latency is higher. I. I NTRODUCTION Recent work has pointed at the Dataflow execution model as a good alternative to exploit thread-level parallelism [1]–[3]. In the Dataflow model, programs can be described as a graph, where nodes represents Instructions (or tasks) and edges in the graph describe their dependencies. Execution is guided by the dataflow firing rule: instructions are fired as soon as all of their input operands are ready (i.e., all of their parents have completed). A problem that arises from this strategy is the need to map instructions to processing elements (PEs). Once you have described the program dependencies in a dataflow graph, you must decide where each instruction will be placed (i.e., which available PE will execute which instruction). A good mapping must balance the fact that the more instructions are spread among PEs, the more parallelism will be available, but also the more communication overhead one will have. Therefore, an ideal scheduling strategy must aim at obtaining a good trade- off between communication and parallelism. Throughout this work, we use the terms “map” and “schedule” interchangeably. Previous work on scheduling dependency graphs have mostly focused on DAGs (directed acyclic graphs), with good results being achieved for statical (or offline) mapping [4]– [8]. However, mapping algorithms for DAGs are not suitable for dataflow graphs, since dataflow programs often contain cycles corresponding to the loops in the code. We present the SCC-Map: a new static mapping algorithm for dependency graphs that contain cycles (namely, for dataflow graphs). Our proposal is based on the work of Boyer and Hura [6], which was aimed at DAGs. We further compare our novel approach towards dataflow graphs with other algorithms, such as the ones presented in [6], [7]. In order to validate our ideas, we developed a cycle-by- cycle dataflow simulator, allowing us to investigate in detail the effects of each mapping strategy. Then, we compiled a set of benchmarks to run on the simulator varying the placement algorithm, communication latency, and the number of available processing elements in the system. Mappings were obtained by a set of reference algorithms and SCC-Map. We compared the speedups in all scenarios, taking as baseline the serial execution, i.e., the case where all instructions mapped to the same PE. Moreover we provide a theoretical maximum speedup that could be achieved for each application. This value was obtained by placing each instruction of the program in a distinct PE and setting the communication latency to 1 clock cycle (minimum possible latency). Our results show that, for most of the tested scenarios, our set of programs run significantly faster with the instruction- to-PE mappings obtained with our algorithm. Moreover, we observed that SCC-Map is more effective than the other mapping algorithms when communication latency is higher in the system. The rest of this paper is organized as follows: Section II discusses the relevance of the mapping problem in the context of dataflow systems and presents TALM (the dataflow model used as basis to build the simulator to test SCC-Map); in Section III we discuss some related works; Section IV presents SCC-Map (our mapping algorithm); results are presented and discussed in Section V; we conclude and indicate possible future works in Section VI. 2012 Third Workshop on Applications for Multi-Core Architecture 978-0-7695-4916-3/12 $26.00 © 2012 IEEE DOI 10.1109/WAMCA.2012.8 54