A Dynamic Pressure-Aware Associative Placement Strategy for Large Scale Chip Multiprocessors Mohammad Hammoud, Sangyeun Cho, and Rami G. Melhem Department of Computer Science, University of Pittsburgh {mhh,cho,melhem}@cs.pitt.edu Abstract—This paper describes dynamic pressure-aware associative placement (DPAP), a novel distributed cache management scheme for large-scale chip multiprocessors. Our work is motivated by the large non-uniform distribution of memory accesses across cache sets in different L2 banks. DPAP decouples the physical locations of cache blocks from their addresses for the sake of reducing misses caused by destructive interferences. Temporal pressure at the on-chip last-level cache, is continuously collected at a group (comprised of local cache sets) granularity, and periodically recorded at the memory controller(s) to guide the placement process. An incoming block is consequently placed at a cache group that exhibits the minimum pressure. Simulation results using a full-system simulator demonstrate that DPAP outperforms the baseline shared NUCA scheme by an average of 8.3% and by as much as 18.9% for the benchmark programs we examined. Furthermore, evaluations showed that DPAP outperforms related cache designs. Index Terms—Chip Multiprocessors, Associative Placement, Pressure-Aware Placement, Aggregate Cache Sets, Local Cache Sets. 1 I NTRODUCTION AND MOTIVATION A S large uniprocessors are no longer scaling in performance, chip multiprocessors (CMPs) have become the trend in computer architecture. CMPs can easily spread multiple threads of execution across various cores. Besides, CMPs scale across generations of silicon process simply by stamping down copies of the hard-to-design cores on successive chip generations [9]. One of the key challenges to obtaining high performance from CMPs is organizing and managing the limited on-chip cache resources (typically the L2 cache) shared among co-scheduled threads/processes. Tiled chip multiprocessor (CMP) architectures have recently been advocated as a scalable processor design approach [4], [14]. They replicate identical building blocks (tiles) and connect them with a switched network on-chip (NoC) [14]. A tile typically incorporates private L1 caches and an L2 cache bank. L2 cache banks are accord- ingly physically distributed over the processor chip. A conventional practice, referred to as the shared scheme, logically shares these physically distributed cache banks. On-chip access latencies differ depending on the distances between requester cores and target banks creating a Non Uniform Cache Architecture (NUCA) [7]. Recent research work on CMP cache management has recognized the importance of the NUCA shared design [3], [4]. Besides, many of today’s multi-core processors, the Intel Core TM 2 Duo processor family [12], Sun Niagara [8], and IBM Power5 [16], have featured shared caches. A shared organization, however, suffers from an interference problem. A defectively behaving application can evict useful L2 cache content belonging to other co-scheduled programs. As such, a program that exposes temporal locality can experience frequent cache misses caused by interferences. We observe that 69.5% of misses on a 16-way tiled shared CMP platform are inter-processor (a line being replaced at an earlier time by a different processor). 1 We primarily correlate destructive interferences problem to the root of CMP cache management, the cache placement algorithm. Manuscript submitted: 18-Apr-2010. Manuscript accepted: 14-May-2010. Final manuscript received: 18-May-2010. This work was supported in part by NSF grant CCF-0952273. 1. Section 4.1 describes the adopted CMP platform, the experimental parameters, and the benchmark programs we examined.                                                    Fig. 1. Number of misses per 1 million instructions (MPMI) experienced by two local cache sets (the ones that experience the max and the min misses) at different aggregate sets for two benchmarks, Swaptions and MIX2. Fig. 1 demonstrates the number of misses per 1 million instructions experienced by cache sets across L2 cache banks (or aggregate sets) for two benchmarks, Swaptions and MIX2 (see Section 4.1 for experimental details). We define an aggregate set with index i as the union of sets with index i across L2 cache banks. More formally, an aggregate set i = n k=1 set ki where set ki is the set with index i at bank k. We refer to each set ki as a local set. We assume a 16- way tiled CMP platform with physically distributed, logically shared L2 banks. We only show results for two local sets that exhibit the maximum and the minimum misses, in addition to the average misses, per each aggregate set. Clearly, we can see that memory accesses across aggregate sets are asymmetric. A placement strategy aware of the current pressures at banks can reduce the workload imbalance among aggregate sets by preventing placing an incoming cache block at an exceedingly pressured local set. This can potentially minimize interference misses and maximize system performance. Traditionally, cache blocks are stored at cache locations solely based on their physical addresses. This makes the placement process unaware of the disparity in the hotness of the shared cache sets. In this work, we explain the importance of incorporating pressure- aware associative placement strategies to improve CMP system per- formance. We propose dynamic pressure-aware associative placement (DPAP), a novel mechanism that involves a low-hardware overhead framework to monitor the L2 cache banks at a group (comprised of local cache sets) granularity and record pressure information at an array embedded within the memory controller. The collected pressure Posted to the IEEE & CSDL on 5/25/2010 DOI 10.1109/L-CA.2010.7 1556-6056/10/$26.00 © 2010 Published by the IEEE Computer Society IEEE COMPUTER ARCHITECTURE LETTERS, VOL. 9, NO. 1, JANUARY-JUNE 2010 29