ENERGY-PERFORMANCE TRADEOFFS FOR THE SHARED MEMORY IN MULTI-PROCESSOR SYSTEMS-ON-CHIP K. Patel, E. Macii Politecnico di Torino Dipartimento di Automatica e Informatica Torino, ITALY M. Poncino Universit` a di Verona Dipartimento di Informatica Verona, ITALY ABSTRACT Accesses to the shared memory in multi-processor systems-on- chip represent a signiﬁcant performance bottleneck and source of energy consumption, because of the synchronization of the ac- cesses as well as of their relative “distance” from the processors. While multi-port memories are usually employed to solve perfor- mance issues, no standard solution exists for energy issues. In this work, we propose an architecture for the shared memory that is based on application-speﬁcic partitioning, which offers var- ious energy-performance tradeoffs. The shared address space of the application is split into two, non-overlapping subsets which are mapped onto two distinct memory blocks. By properly tuning the point at which the address space is split, various solutions can be achieved with different levels of energy/performance tradeoffs. Experiments on a set of parallel benchmarks show that it is possi- ble to achieve solutions with energy-delay product savings ranging from 40 to 62% (50% on average) with respect to a conventional, non-partitioned architecture. 1. INTRODUCTION Modern SoC platforms normally host several processors that exe- cute applications that exhibit highly complex communication pat- terns. When considering shared-memory multiprocessors platforms, this communication activity physically occurs through one or more shared memories. For applications in which parallelism is built around shared data (e.g., by partitioning data structures over the processors), the accesses to the shared memories may take a sig- niﬁcant fraction of the execution cycles, as well as of the total energy consumption. In this scenario, the implementation of shared memory as a single, monolithic memory block is not the best choice, from both the performance and energy point of view. Concerning performance, the inefﬁciency comes from the fact that conventional memories cannot exploit the memory accesses that could occur in parallel. A well-accepted solution to this problem is the use of multi-port memories, which increase bandwidth by construction: A P -port memory allows in fact up to P accesses in parallel (i.e., in a single cycle). Concerning energy consumption, however, there is no intuitive solution to tackle the energy increase. In fact, the use of multi- port memories, that would improve performance, makes the en- ergy issues even more problematic, since the bandwidth increase comes at the price of a signiﬁcant increase in energy consump- tion. Designers can thus resort to energy-efﬁcient memory archi- tectures [1, 2, 3], which, however, have been speciﬁcally devised for uniprocessors system. The objective of this work is to propose a novel shared-memory architecture template that combines the advantages, in terms of bandwidth, of the multi-port approach, with the advantages, in terms of energy consumption and access time, of partitioned, single- port memories. The proposed solution is based on the idea of application-speciﬁc memory partitioning [4], and it consists of us- ing single-port blocks that can be accessed in parallel (thus mim- icking a multi-port memory), yet with an individual energy cost that is smaller than that of the monolithic memory. We introduce analytical expression for performance and energy consumption that allow to explore the energy-performance trade- off. We present experimental results showing that the new ar- chitecture allows to achieve energy-delay product savings ranging from 40 to 62% (50% on average) with respect to a conventional, non-partitioned architecture. 2. PARTITIONED SHARED MEMORY ARCHITECTURE The target architecture of this work is that of multi-processor SoCs, where each processor core has a cache and a private memory (PM), containing private data and code, which is accessed through a lo- cal bus. The processors are also connected to another memory, through a common global bus containing the data that are shared between the various threads executing on the processors. In this work we do not consider other types of interconnections such as point-to-point ones or distributed (i.e., network-like) ones. Accesses to the shared memory are signiﬁcantly slower than ac- cesses to the private ones, because of (i) the access to the shared global bus by the processors requires some form of arbitration, and (ii) potentially simultaneous accesses to the shared memory need to be serialized. The latter aspect, in particular, requires extra exe- cution cycles that transforms that also imply an increase in energy. The shared memory represents thus a source of performance penalty and, as a consequence, extra energy consumption. In this work, we propose an architecture for the shared memory that improves both performance (that is, number of execution cy- cles) and energy consumption, with respect to a conventional orga- nization. In our scheme, the memory address space is mapped over distinct memory banks that can be simultaneously accessed by the different processors. Each bank covers non-overlapping subsets of the address space, with no replication of memory words. Perfor- mance improvement is achieved by allowing parallelization of the accesses by accessing the various blocks in parallel, thus mimick- ing the behavior of multi-port memories.