Efficiency and Scalability of Barrier Synchronization on NoC Based Many–core Architectures Oreste Villa Pacific Northwest National Laboratory, High Performance Computing, 902 Battelle Boulevard, 99354 Richland (WA), USA oreste.villa@pnl.gov Gianluca Palermo Politecnico di Milano, Dipartimento di Elettronica e Informazione, Via Ponzio 34/5, 20133 Milano, Italy gpalermo@elet.polimi.it Cristina Silvano Politecnico di Milano, Dipartimento di Elettronica e Informazione, Via Ponzio 34/5, 20133 Milano, Italy silvano@elet.polimi.it ABSTRACT Interconnects based on Networks-on-Chip are an appealing solution to address future microprocessor designs where, very likely, hundreds of cores will be connected on a single chip. A fundamental role in highly parallelized applications run- ning on many-core architectures will be played by barrier primitives used to synchronize the execution of parallel pro- cesses. This paper focuses on the analysis of the efficiency and scalability of different barrier implementations in many- core architectures based on NoCs. Several message passing barrier implementations based on four algorithms (all-to-all, master-slave, butterfly and tree) have been implemented and evaluated for a single-chip target architecture composed of a variable number of cores (from 4 to 128) and different net- work topologies (mesh, torus, ring, clustered-ring and fat- tree). Using a cycle-accurate simulator, we show the scala- bility of each barrier for every NoC topology, analyzing and comparing theoretical with real behaviors. We observed that some barrier algorithms, when implemented in hardware or software, show a different scaling behavior with respect to those theoretically expected. We evaluate the efficiency of each combination topology-barrier, demonstrating that, in many cases, simple network topologies can be more efficient than complex and highly connected topologies. Categories and Subject Descriptors: C.4 Computer Systems Organization Performance of Systems General Terms: Algorithms, Design, Performance Keywords: Barrier, Efficiency, Manycore, Multicore, NoC, Scalability, Synchronization 1. INTRODUCTION Many-core architectures based on NoCs [3, 6] are a promis- ing solution to tackle modern and future processor design challenges, as power consumption, scalability and ease of de- sign reuse. The basic idea is to use on-chip packet switches as building blocks to create a micro network as interconnect, Copyright 2007 Association for Computing Machinery. ACM acknowl- edges that this contribution was authored or co-authored by an employee, contractor or affiliate of the U.S. Government. As such, the Government re- tains a nonexclusive, royalty-free right to publish or reproduce this article, or to allow others to do so, for Government purposes only. CASES’08, October 19–24, 2008, Atlanta, Georgia, USA. Copyright 2008 ACM 978-1-60558-469-0/08/10 ...$5.00. similarly to what appends in traditional large-scale multi- processors and distributed computing networks. NoC inter- connects provide a low latency communication layer which solves physical limitations due to wire latency, providing when compared to bus based solutions higher bandwidth and parallelism in the communication. Great attention has been posed by many researches to find clever topologies and opti- mized routing algorithms able to meet such expectations [15, 12, 14]. Current NoC based architectures are still in their infancy and, only few companies (mainly not mainstream oriented) are starting to develop and to commercialize them, in spe- cific domains such as transaction intensive applications and network processing [17, 5]. This is mainly due to the fact that current manufacturing technologies do not allow to in- tegrate more than a dozen of cores of medium complex- ity (i.e. simple pipelines without Out-of-Order capabilities) while the NoC approach gains in performance and power (over traditional bus based solutions) when the number of interconnected units become relevant. However, in the near future, as predicted by many [4, 2], as CMOS technologies will continue to shrink and the number cores integrated on a chip will increase, NoC based interconnects will start – hopefully– to reach mainstream products. A recent exam- ple of this trend has been shown with the Polaris test chip in the Tera-Scale [8] research initiative proposed by Intel where eighty simple cores are connected by using a bidirec- tional mesh. This chip, while still unable to execute complex control structures, is capable to deliver teraflop-level perfor- mance consuming less than 100 Watts. One of the key points in multi-processor programming is that, when multiple processing elements are concurrently executing the same application, it is necessary to ensure correctness by adding synchronization points. Compiler- parallelized and scientific data-intensive applications use col- lective communication primitives to synchronize parallel ex- ecution but, as the number of cores increases, their imple- mentation becomes a challenging task. Conventional tech- niques require to insert a barrier [18] in a point of the code where we want that all processes (or a given number of them) arrive before the elaboration proceeds. Barriers are latency sensitive global operations which can be realized with many different techniques (centralized counters, all-to-all global exchange, etc.) and adopting several algorithms (master slave, butterfly, tree, etc.). Barriers require low latency com- munication channels to minimize overall completion time 81