ESP-NUCA: A Low-cost Adaptive Non-Uniform Cache Architecture Javier Merino, Valentin Puente and Jose A. Gregorio Computer Architecture Group University of Cantabria, Santander, Spain merinocj@unican.es, vpuente@unican.es, monaster@unican.es Abstract This paper introduces a cost effective cache architec- ture called Enhanced Shared-Private Non-Uniform Cache Architecture (ESP-NUCA), which is suitable for high- performance Chip MultiProcessors (CMPs). This archi- tecture enhances system stability by combining the ad- vantages of private and shared caches. Starting from a shared NUCA, ESP-NUCA introduces a low-cost mecha- nism to dynamically allocate private cache blocks closer to their owner processor. In this way, average on-chip access latency is reduced and inter-core interference minimized. ESP-NUCA synergistically integrates victims and replicas thus making it possible to take advantage of multiple- readers for shared data, and to maximize cache usage under unbalanced core utilization. This architecture leads to sta- ble behavior within the whole system across a broad spec- trum of working scenarios. ESP-NUCA not only outper- forms architectures with similar implementation costs such as private and shared caches by up to 20% and 40% re- spectively, but even outperforms much costlier architectures such as D-NUCA [13] by up to 28%, Adaptive Selective Replication [3] by up to 19%, and Cooperative Caching [5] by up to 15%. Moreover, performance variance throughout the set of benchmarks is 37% lower than with ASR, 87% lower than with D-NUCA, and 43% lower than with Coop- erative Caching. 1. Introduction The future evolution in the number of cores per chip of CMP architectures could be jeopardized by the available off-chip bandwidth. In order to minimize this effect, a large amount of intra-chip cache should be provided. Although the transistor budget is generous, multi-megabyte cache hi- erarchy with many-core CMP represents a challenge. First of all, it is necessary to define the sharing crite- ria among the cores of the CMP, for the on-chip portion of the memory hierarchy. The most convenient sharing pol- icy is strongly dependent on the workload. On the one hand, some applications are characterized by a significant sharing degree whereas others have little to no sharing at all. On the other hand, simultaneous threads could inter- fere destructively in the memory hierarchy. The usage sce- narios of high-performance CMPs are very dissimilar, rang- ing from number crunching applications to information pro- cessing suites or desktop applications. In order to provide a truly general purpose system, the on-chip memory hier- archy should be smart enough to adapt its behavior to very different working conditions. The Last Level Cache (LLC) may be structured as pri- vate or shared. Although this discussion is equally appli- cable with three on-chip cache levels, to simplify it, we will assume in the rest of the paper that only two levels are present, therefore LLC will be equivalent to L2. From an architectural point of view, private and shared caches exhibit different properties. Private caches are character- ized by lower on-chip access latency, as they enable the emplacement of cache blocks closer to the owner proces- sor. Moreover, they provide inter-thread isolation, elim- inating most unnecessary inter-core interference. Shared caches are distinguished by lower off-chip miss rates than private caches because shared data is not replicated through- out different L2 locations. They can also outperform private caches when threads with unbalanced memory usage run in different cores of the chip or a reduced number of cores run active threads. Consequently, depending on the inher- ent characteristics of the system workload, each architec- tural design could outperform the other [10, 22]. Notwith- standing, the on-chip memory hierarchy of general purpose CMPs should be flexible enough to adapt its responsiveness to the requirements of the existing workload, maximizing hit rates, minimizing on-chip access latency and reducing unnecessary inter-core conflicts in order to achieve a stable performance over a large range of scenarios. A plethora of studies have made proposals to deal with the previously identified issues. Some propose starting with a private cache and limiting the performance degradation produced by block replication [3, 5, 6, 22]. Others, using a shared scheme as the baseline, attempt to minimize on-chip