CMSM: An Efﬁcient and Effective Code Management for Software Managed Multicores Ke Bai, Jing Lu, Aviral Shrivastava and Bryce Holton Compiler Microarchitecture Laboratory Arizona State University, Tempe, Arizona 85287, USA Email: {Ke.Bai, Jing Lu, Aviral.Shrivastava, Bryce.Holton}@asu.edu Abstract—As we scale the number of cores in a multicore processor, scaling the memory hierarchy is a major challenge. Software Managed Multicore (SMM) architectures are one of the promising solutions. In an SMM architecture, there are no caches, and each core has only a local scratchpad memory. If all the code and data of the task mapped to a core do not ﬁt on its local scratchpad memory, then explicit code and data management is required. In this paper, we solve the problem of efﬁciently managing code on an SMM architecture. We extend the state of the art by: i) correctly calculating the code management overhead, ii) even in the presence of branches in the task, and iii) developing a heuristic CMSM (Code Mapping for Software Managed multicores) that results in efﬁcient code management execution on the local scratchpad memory. Our experimental results collected after executing applications from MiBench suite [1] on the Cell SPEs (Cell is an SMM architecture) [2], demon- strate that correct management cost calculation and branch consideration can improve performance by 12%. Our heuristic CMSM can reduce runtime in more than 80% of the cases, and by up to 20% on our set of benchmarks. Keywords—Code, instruction, local memory, scratchpad mem- ory, SPM, embedded systems, multi-core processor I. I NTRODUCTION We are in a transition from multicore processors to many- core processors. While scaling the number of cores is relatively straightforward, scaling the memory hierarchy is a major challenge. Most experts believe that fully cache-coherent archi- tectures will not scale when there are hundreds and thousands of cores, and therefore architects are looking for alternative scalable architecture designs. Recently, a 48-core non-coherent cache architecture named Single-chip Cloud Computer (SCC) was manufactured by Intel [3]. The latest 6-core DSP from Texas Instruments, TI 6472 [4] features non-coherency cache architecture. But caches still consume a large portion of power and die area [5]. A promising option for an even more power- efﬁcient and scalable memory hierarchy is to not have caches, but only scratchpad memories. Scratchpad memories are raw memories that do not have any tags and lookup logic. As a result, they consume approximately 30% less area and power than a direct mapped cache of the same effective capacity [5]. Therefore, such scratchpad based multicore architectures have the potential to be more power-efﬁcient and scalable than traditional cache-based architectures. However, this improvement in power-efﬁciency comes at the cost of programmability. Since there is no data manage- This research was funded by grant from National Science Foundation CCF-0916652. ment implemented in the hardware in these scratchpad based multicore architectures, data must be managed by the applica- tion in software. This means that the data that the application will require must be brought into the local scratchpad memory using a Direct Memory Access (DMA) command before it is used, and can be evicted back to the main memory (also using the DMA command) after it is used. Due to this explicit need of data management in software, these processor designs are termed, Software Managed Multicore (SMM) architectures. A very good example of SMM memory architecture is the Cell processor that is incorporated in the Sony Playstation 3. The Synergistic Processing Elements (SPEs) in the Cell processor have only scratchpad memories. Its power efﬁciency is around 5 GFlops per watt [2]. SMM architecture is a truly “distributed memory architec- ture on-a-chip.” Applications for it are written in the form of interacting tasks. The tasks are mapped to the cores of the SMM architecture. Each execution core can only access its local scratchpad memory, and to access other local memories or the main memory, explicit DMA instructions are required in the program. The local memory is shared among code, stack data, global data and heap data of the task executing on the core. How to manage the task data on scratchpad memory of the cores is an important problem that has drawn signiﬁcant at- tention in recent years [6]–[14]. While management is needed for all code and data of the task when they cannot ﬁt in the local memory, in this paper we will focus on the problem of code management, since efﬁcient code management can be of considerable signiﬁcance to the performance of the system. The ﬁrst step in code management is to assign some space in the local scratchpad memory for managing code. Then we divide this space into regions, and functions in the program are mapped to these regions. Functions mapped to the same region are compiled and linked starting with the same start address (that of the region). At runtime, only one function out of the ones that are mapped to the region can be present in the region. At each function call, it is checked whether the function being called is present in the region or not. If not, it is fetched from the main memory using a DMA command [15]. Therefore, the size of region is equal to the size of the largest function mapped to the region, and the total code space required is the sum of the sizes of the regions. Given some space on the local memory, the code management problem is to i) divide the code space into regions, and ii) ﬁnd a mapping of functions to regions, so that the management overhead is minimized. We estimate the memory management overhead to be proportional to the size of code that needs to be transferred between the local memory and the main memory. 978-1-4799-1417-3/13/$31.00 c  2013 IEEE