An approach for Supporting OpenMP on the Intel SCC Hayder Al-Khalissi Chair of Chip-Design for Embedded Computing MÃijhlenpfordtstraÃ§e 23 38106 Braunschweig alkhalissi@c3e.cs.tu- bs.de Andrea Marongiu DEIS-University of Bologna Viale Risorgimento 2 40133 Bologna amarongiu@deis.unibo.it Mladen Berekovic Chair of Chip-Design for Embedded Computing MÃijhlenpfordtstraÃ§e 23 38106 Braunschweig berekovic@c3e.cs.tu- bs.de ABSTRACT The advent of the Single-chip Cloud Computer (SCC) chip in the many-core realm imposes challenges to programmers. From a programmer’s perspective is desirable to use the shared memory paradigm, employing high-level parallel pro- gramming abstractions such as OpenMP. In this paper we discuss our ongoing eﬀorts to support OpenMP on SCC. Speciﬁcally, we focus on the following three key aspects in our approach: i) Investigating an implementation that is aware of the memory hierarchy. ii) How to handle OpenMP shared variables. iii) Eﬃciently implementing synchroniza- tion (i.e., barrier) constructs by leveraging SCC hardware support. To meet this need, we propose eﬀective barrier synchronization implementations for OpenMP on the SCC. In particular, we present an eﬃcient evaluation of the over- head associated with integrating barrier algorithms that is required for OpenMP run-time libraries on such a machine. Our initial experimental results show signiﬁcant performance improvement up to 98% for 48 cores. Keywords MPSoC, OpenMP, Barrier synchronization. 1. INTRODUCTION Intel’s SCC platform [10] is dedicated to exploring the fu- ture of many-core computing. It is a research architecture resembling a small cluster or “cloud” of computers, therefore, it interesting in a variety of diﬀerent application through HPC space. The SCC architecture has 48 independent Pen- tium P54C cores, each with 16kB data and program caches and 256kB L2 cache. The cores are organized as 24 dual- core tiles connected via a low-latency mesh network. Each tile connects to a router and contains two cores, a Mesh Interface Unit (MIU), and a pair of test-and-set registers for realizing atomic access. Moreover, SCC features four on-chip DDR3 memory controllers, which are connected to the 2D-mesh as well. Each controller supports up to 16GB DDR3 memory, resulting in a total system capacity of 64GB. Being based on the P54C architecture, each core is able to access only 4GB of memory. To solve this limitation, each core has Lookup Tables (LUT) with 256 entries with 16 MB granularity translate the address mapping 32-bit physical core addresses to the 64GB system memory. It is part of the conﬁguration register space that is itself mapped by a LUT entry and shareable between cores. Each entry in LUT is conﬁgurable and points to speciﬁc types of memory spaces (oﬀ/on-chip memory, conﬁguration and synchronization reg- isters). The SCC does not oﬀer cache coherency between the cores, but rather employs special 16kB-sized Message Pass- ing Buﬀer (MPB) for improved communication eﬃciency between cores. A new CL1INVMB instruction together with a dedicated message passing buﬀer type (MPBT) are used to provide coherency guarantee between caches and MPBs. MPBT data is not cached in the L2 cache, but only in the L1 cache. Hence, when reading the MPBs, a core needs to clear the L1 cache. As the SCC cores only support a single outstanding write request, a Write Combine Buﬀer (WCB) is used in MPBT mode to combine adjacent writes up to a whole cache line which can then be written to the memory at once. When a core wants to update a data item in the MPB, it can invalidate the cached copy using the CL1INVMB instruction. Given this hardware conﬁguration, the SCC is designed to support the message-passing based program- ming models. One well-known customized library providing the message-passing model is RCCE [30]. OpenMP [6] is a de-facto standard for shared memory programming, since it provides very simple means to expose parallelism in a standard C (or C++, or Fortran) applica- tion, based on code annotations (compiler directives). This appealing ease of use has recently led to the ﬂourishing of a number of OpenMP implementations for embedded Multi- processor systems-on-chips (MPSoCs) [18, 20, 24, 22]. MP- SoCs typically feature complex memory systems, with explic- itly managed SRAM banks and NUMA organization. SCC is no diﬀerent in this respect, and poses several challenges to accommodating the OpenMP execution model. First, each core runs a separate instance of the operating system, which makes it impossible to run existing OpenMP implementa- tions based on standard threading library (e.g., Pthreads) directly. Second, barrier primitives should leverage fast and local memories such as the MPB to minimize inter-thread synchronization time. Third, data sharing is not at all triv- ial, as OpenMP assumes a ﬂat memory model, which is un- matched by the distinct private virtual memory segments seen by diﬀerent SCC cores.