An approach for Supporting OpenMP on the Intel SCC Hayder Al-Khalissi Chair of Chip-Design for Embedded Computing MÃijhlenpfordtstraçe 23 38106 Braunschweig alkhalissi@c3e.cs.tu- bs.de Andrea Marongiu DEIS-University of Bologna Viale Risorgimento 2 40133 Bologna amarongiu@deis.unibo.it Mladen Berekovic Chair of Chip-Design for Embedded Computing MÃijhlenpfordtstraçe 23 38106 Braunschweig berekovic@c3e.cs.tu- bs.de ABSTRACT The advent of the Single-chip Cloud Computer (SCC) chip in the many-core realm imposes challenges to programmers. From a programmer’s perspective is desirable to use the shared memory paradigm, employing high-level parallel pro- gramming abstractions such as OpenMP. In this paper we discuss our ongoing efforts to support OpenMP on SCC. Specifically, we focus on the following three key aspects in our approach: i) Investigating an implementation that is aware of the memory hierarchy. ii) How to handle OpenMP shared variables. iii) Efficiently implementing synchroniza- tion (i.e., barrier) constructs by leveraging SCC hardware support. To meet this need, we propose effective barrier synchronization implementations for OpenMP on the SCC. In particular, we present an efficient evaluation of the over- head associated with integrating barrier algorithms that is required for OpenMP run-time libraries on such a machine. Our initial experimental results show significant performance improvement up to 98% for 48 cores. Keywords MPSoC, OpenMP, Barrier synchronization. 1. INTRODUCTION Intel’s SCC platform [10] is dedicated to exploring the fu- ture of many-core computing. It is a research architecture resembling a small cluster or “cloud” of computers, therefore, it interesting in a variety of different application through HPC space. The SCC architecture has 48 independent Pen- tium P54C cores, each with 16kB data and program caches and 256kB L2 cache. The cores are organized as 24 dual- core tiles connected via a low-latency mesh network. Each tile connects to a router and contains two cores, a Mesh Interface Unit (MIU), and a pair of test-and-set registers for realizing atomic access. Moreover, SCC features four on-chip DDR3 memory controllers, which are connected to the 2D-mesh as well. Each controller supports up to 16GB DDR3 memory, resulting in a total system capacity of 64GB. Being based on the P54C architecture, each core is able to access only 4GB of memory. To solve this limitation, each core has Lookup Tables (LUT) with 256 entries with 16 MB granularity translate the address mapping 32-bit physical core addresses to the 64GB system memory. It is part of the configuration register space that is itself mapped by a LUT entry and shareable between cores. Each entry in LUT is configurable and points to specific types of memory spaces (off/on-chip memory, configuration and synchronization reg- isters). The SCC does not offer cache coherency between the cores, but rather employs special 16kB-sized Message Pass- ing Buffer (MPB) for improved communication efficiency between cores. A new CL1INVMB instruction together with a dedicated message passing buffer type (MPBT) are used to provide coherency guarantee between caches and MPBs. MPBT data is not cached in the L2 cache, but only in the L1 cache. Hence, when reading the MPBs, a core needs to clear the L1 cache. As the SCC cores only support a single outstanding write request, a Write Combine Buffer (WCB) is used in MPBT mode to combine adjacent writes up to a whole cache line which can then be written to the memory at once. When a core wants to update a data item in the MPB, it can invalidate the cached copy using the CL1INVMB instruction. Given this hardware configuration, the SCC is designed to support the message-passing based program- ming models. One well-known customized library providing the message-passing model is RCCE [30]. OpenMP [6] is a de-facto standard for shared memory programming, since it provides very simple means to expose parallelism in a standard C (or C++, or Fortran) applica- tion, based on code annotations (compiler directives). This appealing ease of use has recently led to the flourishing of a number of OpenMP implementations for embedded Multi- processor systems-on-chips (MPSoCs) [18, 20, 24, 22]. MP- SoCs typically feature complex memory systems, with explic- itly managed SRAM banks and NUMA organization. SCC is no different in this respect, and poses several challenges to accommodating the OpenMP execution model. First, each core runs a separate instance of the operating system, which makes it impossible to run existing OpenMP implementa- tions based on standard threading library (e.g., Pthreads) directly. Second, barrier primitives should leverage fast and local memories such as the MPB to minimize inter-thread synchronization time. Third, data sharing is not at all triv- ial, as OpenMP assumes a flat memory model, which is un- matched by the distinct private virtual memory segments seen by different SCC cores.