Vector Class on Limited Local Memory (LLM) Multi-core Processors * Ke Bai, Di Lu, Aviral Shrivastava Compiler Microarchitecture Lab Arizona State University Tempe, AZ 85281, USA {Ke.Bai, dilu3, Aviral.Shrivastava}@asu.edu ABSTRACT Limited Local Memory (LLM) multi-core architecture is a promising solution for scalable memory hierarchy. LLM ar- chitecture, e.g., IBM Cell/B.E. is a purely distributed mem- ory architecture in which each core can directly access only its small local memory, and that is why it is extremely power- efficient. Vector is a popular container class in the C++ Standard Template Library (STL), which provides the func- tionality similar to a dynamic array. Due to the small non- virtualized memory in the LLM architecture, vector library implementation cannot be used as it is. In this paper, we propose and implement a scheme to manage vector class in the local memory present in each core of LLM multi-core ar- chitecture. Our scalable solution can transparently maintain vector data between the shared global memory and the lo- cal memories. In addition, different data transfer granulari- ties are provided by our vector class to achieve better perfor- mance. We also propose a mechanism to ensure the validity of pointers-to-elements when the vector elements are moved into the global memory. Experimental result shows that our vector class can improve the programmability of vector class significantly while the overhead can be contained within 7%. Categories and Subject Descriptors D.3.m [Software]: Miscellaneous; D.1.5 [Software]: Object- oriented Programming General Terms Algorithms, Design, Experimentation, Performance. Keywords Vector, local memory, scratch pad memory, embedded sys- tem, multi-core processor, IBM Cell, PS3, MPI * This research was partially funded by grants from National Science Foundation CCF-0916652, IIP-0856090, NSF I/U- CRC for Embedded Systems, Microsoft Research, SFAz, Raytheon and Stardust Foundation. Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. CASES’11, October 9–14, 2011, Taipei, Taiwan. Copyright 2011 ACM 978-1-4503-0713-0/11/10 ...$10.00. 1. INTRODUCTION As we transition from single core to many cores, main- taining the illusion of a single unified memory in multi-core architecture becomes challenging. There are two main rea- sons. First is that cache coherency protocols do not scale well to hundreds of cores [7], and second is, that even if pos- sible, the overhead of automatically managing memory like caches is becoming prohibitive in terms of power consump- tion. Even in single-core processors, caches can consume more than half of the processor power [7], and are expected to consume much larger fraction in many-core systems. Limited Local Memory (LLM) architecture is a scalable memory architecture, in which each core has a small local memory. Each core can access only its limited local memory. Shown in Figure 1, the IBM Cell B.E. is a popular example of the LLM architecture. It contains one main core, Power Processing Element (PPE) and 8 Synergistic Processing El- ements (SPEs). Each SPE has 256 KB of memory [12]. If all the code and data of that task that is mapped to the SPE fit in the local memory of the SPE, then very power-efficient execution is achieved. In fact, the peak power-efficiency of the IBM Cell processor is 5.1 Giga operations per second per watt [17]. Contrast this with the power-efficiency of tradi- tional shared memory multi-cores, e.g., the Intel Core2 Quad is only 0.35 Giga operations per second per watt [17]. How- ever, if the code and data of the application do not fit into the local memory, then the global memory must be leveraged to contain them through explicit DMA calls. This explicit data management is a challenge in LLM architectures. Standard Template Library (STL) is a popular and generic programming tool and is included in C++ standard library. It provides a set of container classes, which are data struc- tures whose instances are collections of other objects. Vector is a container class which holds data as a dynamic array. As it is dynamic, vector uses a variable size of memory which is proportional to the amount of data it contains. Unfor- tunately, using STL in LLM architectures is difficult. This is because STL library is not aware of the size of the local memory. When using vector on Cell SPE, vector class works fine with a small amount of data. However, when more data is pushed in the vector, it will throw out an error “terminate called after throwing an instance of ‘std::bad alloc’ ”. This happens when the STL wants to allocate space for more data, but there is no more space in the local memory. To support vector class in a LLM architecture, the vector data must be managed between the local and global memory. Many works in parallel and multi-thread programming have investigated supporting parallel STL for homogeneous