Programming Multi-core Architectures Using Data-Flow Techniques 1 Samer Arandi Department of Computer Science University of Cyprus Nicosia, Cyprus Email: samer@cs.ucy.ac.cy Paraskevas Evripidou Department of Computer Science University of Cyprus Nicosia, Cyprus Email: skevos@cs.ucy.ac.cy Abstract—In this paper we present a Multithreaded program- ming methodology for multi-core systems that utilizes Data- Flow concurrency. The programmer augments the program with macros that define threads and their data dependencies. The macros are expanded into calls to the run-time that creates and maintains the dependency graph of the threads and performs the scheduling of the threads using Data-Flow principles. We demonstrate the programming methodology and discuss some of the issues and optimizations affecting the performance. A detailed evaluation is presented using two applications as case studies. The evaluation shows that the two applications scale well and compare favorably with the results of similar systems. Our results demonstrate that Data-Flow concurrency can be efficiently implemented as a Virtual Machine on multi-core systems. I. I NTRODUCTION Over the last five decades high-performance was achieved by relying on improvements in fabrication technologies and architectural/organizational optimizations. However, the most severe limitation of the sequential model, namely its inability to tolerate long latencies, has slowed down the performance gains, forcing the industry to hit the Memory wall. As a result of this and other factors, such as the Power Wall and the diminishing returns of Instruction Level Parallelism, the entire industry had to switch to multiple cores per chip. This ushered the beginning of the Concurrency Era, as it soon became evident that traditional programming models did not allow for efficient utilization of the large number of resources now available on a single chip. The Data-Flow model [1], [2], [3] is an alternative model that handles concurrency and tolerates memory and syn- chronization latencies efficiently. Moreover, the side-effect free semantics of Data-Flow exposes the maximum potential parallelism in programs by enforcing the minimum amount of ordering on execution (i.e. only enforce true data dependen- cies). In this work we present a programming methodology based on the Data-Driven Multithreading (DDM) [4], [5], [6] model of execution that combines Dynamic Data-Flow concurrency with efficient sequential execution. The programming method- 1 This work was supported by the Cyprus Research Promotion Foundation under Grants ΔΠ/0505/25E & ΠENEK/ENIΣX/0308/44 ology targets the Data-Driven Multithreading Virtual Machine (DDM-VM). C Programs are augmented with a set of macros that define: Thread boundaries Producer-consumer relationships amongst the threads The data produced and consumed by each thread The macros expand to calls to the runtime of the DDM- VM to manage the execution according to the DDM model. The macros represent the low-level programming abstraction of the DDM-VM. While currently the programmer adds the macros to the code by hand, two compiler projects, currently under development will automate this task. The first is based on Concurrent Collections (CnC) [7] a platform-independent, high-level parallel language, with the help of a source-to- source compiler. The second is a GCC-based auto-parallelizing compiler for the C Language. This paper presents a macro-based approach for program- ming multi-core architectures using Data-Flow techniques. The approach is demonstrated in detail and some of the factors and optimizations affecting the performance are discussed. An in-depth evaluation of two case study applications highlights the effect of the factors on performance and shows that both applications scale well and outperform the results of similar systems. II. DATA-DRIVEN MULTITHREADING VIRTUAL MACHINE (DDM-VM) Data-Driven Multithreading (DDM) is an execution model that combines the benefits of the Data-Flow model in exploit- ing concurrency with the efficient execution of the control- flow model. DDM decouples the execution from the syn- chronization part of a program and allows them to execute asynchronously, thus tolerating synchronization and commu- nication latencies efficiently. The core of the DDM model is the Thread Scheduling Unit (TSU) [8] responsible for scheduling threads at run-time based on data-availability. DDM utilizes data-driven caching policies [9] to implement deterministic data prefetching which improves locality. The Data-Driven Multithreading Virtual Machine (DDM- VM) is a virtual machine that supports DDM execution on