Modeling Cache Sharing on Chip Multiprocessor Architectures Pavlos Petoumenos, 1 Georgios Keramidas, 1 Håkan Zeffer, 2 Stefanos Kaxiras, 1 Erik Hagersten 2 1 Department of Electrical and Computer Engineering, University of Patras, Greece {ppetoumenos, keramidas, kaxiras}@ee.upatras.gr 2 Department of Information Technology, Uppsala University, Sweden {zeffer, eh}@it.uu.se Abstract — As CMPs are emerging as the dominant architecture for a wide range of platforms (from embedded systems and game consoles, to PCs, and to servers) the need to manage on-chip resources, such as shared caches, becomes a necessity. In this paper we propose a new statistical model of a CMP shared cache which not only describes cache sharing but also its management via a novel fine-grain mechanism. Our model, called StatShare, accurately describes the behavior of the sharing threads using run-time information (reuse-distance information for memory accesses) and helps us understand how effectively each thread uses its space. The mechanism to manage the cache at the cache-line granularity is inspired by Cache Decay, but contains important differences. Decayed cache-lines are not turned-off to save leakage but are rather “available for replacement.” Decay modifies the underlying replacement policy (random, LRU) to control sharing but in a very flexible and non-strict way which makes it superior to strict cache partitioning schemes (both fine and coarse grained). The statistical model allows us to assess a thread’s cache behavior under decay. Detailed CMP simulations show that: i) StatShare accurately predicts the thread behavior in a shared cache, ii) managing sharing via decay (in combination with the StatShare run time information) can be used to enforce external QoS requirements or various high-level fairness policies. 1. Introduction Processor designers are fast moving towards multiple cores on a chip to achieve new levels of performance. Most newly released CPUs are chip multiprocessors and all processor vendors offer at least one CPU model of this design. The goal is to hide long memory latencies as much as possible and maximize performance from multiple threads under strict power budgets. CMPs are becoming the dominant architecture for many server class machines [8,9]. For reasons of efficiency and economy, sharing of some chip resources is a necessity. Shared resources in CMPs typically include Level 2 caches and this creates a need for skilful management policies, since L2 caches are a critical element in the performance of all modern computers. It is essential for the future management of cache resources, as well thread migration strategies, to fully understand how threads sharing a common cache interact with each other. To model and understand cache sharing we have built a new theoretical framework that accurately and concisely describes the application interplay in shared caches. Our cache model, named StatShare, is derived from the StatCache statistical cache model [6], which yields the miss ratio of an application for any cache size from a single set of reuse-distance measurements. While the StatCache model uses the number of memory references as its unit of “time,” StatShare uses the number of cache replacements at the studied cache level (Cache Allocation Ticks, CAT [4]) as the unit of time. This allows for a natural mapping of the cache statistics to the shared cache level. This further leads to a very efficient implementation of the StatShare which enables online analysis feeding, for example, a dynamic resource scheduler —in contrast, StatCache is an off-line model. This paper shows, with detailed CMP simulation and co-scheduled applications, that StatShare accurately predicts both miss ratios and cache footprints online. We also demonstrate how StatShare can be used to manage a shared cache. We model and evaluate a control mechanism based on Cache Decay, initially proposed for leakage reduction in uniprocessor caches [7]. The original Cache Decay uses cycle timers in each cache line to turn power off to the cache line after a period of inactivity (called “decay interval”). By tuning this decay interval, one can restrict the “active ratio” of an application (i.e., its “live” lines) to a small percentage of the cache without significantly impacting performance. Decay discards dead lines that are unlikely to be accessed in the future. Similarly, in a shared cache we use decay to control the active ratio of applications so we can enforce high-level policies, for example QoS policies [13], cache fairness policies [2,3,14], or simply optimizations for performance [1,15]. However, our proposed mechanism introduces important differences to decay. A decayed cacheline is simply available for replacement rather than turned-off for leakage. Thus, hits on decayed lines are allowed (since they are still in the cache). Secondly, the decay interval is measured not in cycles but in CAT time. This gives CAT Decay some interesting properties that can be used in conjunction with our model to determine the number of decay-induced misses and the space that is released by