ABSTRACT Chip Multiprocessors (CMPs) allow different applications to concurrently execute on a single chip. When applications with differing demands for memory compete for a shared cache, the conventional LRU replacement policy can significantly degrade cache performance when the aggregate working set size is greater than the shared cache. In such cases, shared cache performance can be significantly improved by preserving the entire working set of applications that can co-exist in the cache and preserving some portion of the working set of the remaining applications. This paper investigates the use of adaptive insertion policies to manage shared caches. We show that directly extending the recently proposed dynamic insertion policy (DIP) is inadequate for shared caches since DIP is unaware of the characteristics of individual applications. We propose Thread-Aware Dynamic Insertion Policy (TADIP) that can take into account the memory requirements of each of the concurrently executing applications. Our evaluation with multi-programmed workloads for 2-core, 4-core, 8-core, and 16-core CMPs show that a TADIP-managed shared cache improves overall throughput by as much as 94%, 64%, 26%, and 16% respectively (on average 14%, 18%, 15%, and 17%) over the baseline LRU policy. The performance benefit of TADIP is 2.6x compared to DIP and 1.3x compared to the recently proposed Utility-based Cache Partitioning (UCP) scheme. We also show that a TADIP-managed shared cache provides performance benefits similar to doubling the size of an LRU-managed cache. Furthermore, TADIP requires a total storage overhead of less than two bytes per core, does not require changes to the existing cache structure, and performs similar to LRU for LRU friendly workloads. Categories and Subject Descriptors B.3.2 [Design Styles]: Cache memories, C.1.4 [Parallel architectures] General Terms Design, Performance. Keywords Shared Cache, Cache Partitioning, Set Dueling, Replacement 1. INTRODUCTION High-performance processors typically contain multiple cores on a single chip which allows them to execute multiple applications (or threads) concurrently. As multi-core processors become pervasive, a key design issue facing processor architects is organizing and managing the on-chip last-level cache (LLC). Since shared caches enable more flexible and dynamic allocation of cache resources, recent processors such as Intel's Core Duo [1], IBM's Power 5 [6] and Sun's Niagara [8] have opted for a shared last-level cache (LLC). As the number of cores on chip increases, the contention caused by applications sharing the LLC increases as well. Thus, performance of such systems is heavily influenced by how efficiently the shared cache is managed. The commonly used LRU replacement policy implicitly allocates cache resources to competing applications based on the rate of demand. As a result, it often allocates cache resources to applications that do not benefit from the cache[18][13]. Shared cache performance can be significantly improved from a cache management scheme that allocates cache resources to applications based on benefit rather than rate of demand. This study focuses on dynamic management of a shared cache among competing applications. There are four goals we seek from such a management scheme: High performance, robustness, scalability and low design overhead. It should have high-performance for a given metric of performance. Since future processors are expected to have a large number of cores, the variety of competing applications in a workload mix is expected to be high. So the proposed cache management policy must not degrade performance (significantly) of workload mixes where the baseline LRU policy works well. Furthermore, since the number of concurrently executing applications is expected to increase with the number of cores, the proposed mechanism must be scalable to a large number of cores. And, finally, from an implementation point of view, the mechanism must have low overhead and avoid extra storage structures so that the area, power, verification, testing and design overheads are minimized. This paper seeks to design such a high-performance, robust, scalable, and negligible hardware overhead mechanism to manage shared caches. A recent study [12] showed that dynamically changing the insertion policy can provide high-performance cache management for private caches at negligible hardware and design overhead. The proposed Dynamic Insertion Policy (DIP) [12] consists of two component policies: the Bimodal Insertion Policy (BIP) and the traditional LRU policy. BIP is a thrashing-resistant § Moinuddin Qureshi contributed to this work prior to joining IBM Research. Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. PACT’08, October 25–29, 2008, Toronto, Ontario, Canada. Copyright 2008 ACM 978-1-60558-282-5/08/10...$5.00. Adaptive Insertion Policies for Managing Shared Caches Aamer Jaleel † William Hasenplaugh † Moinuddin Qureshi § Julien Sebot ‡ Simon Steely Jr. † Joel Emer † § IBM T. J. Watson Research Center Yorktown Heights, NY mkquresh@us.ibm.com ‡ Intel Israel Design Center Haifa, Israel julien.sebot@intel.com † Intel Corporation, VSSAD Hudson, MA {aamer.jaleel, william.c.hasenplaugh, simon.c.steely.jr, joel.emer} @intel.com