A Dynamic Pressure-Aware Associative Placement
Strategy for Large Scale Chip Multiprocessors
Mohammad Hammoud, Sangyeun Cho, and Rami G. Melhem
Department of Computer Science, University of Pittsburgh
{mhh,cho,melhem}@cs.pitt.edu
Abstract—This paper describes dynamic pressure-aware associative placement (DPAP), a novel distributed cache management
scheme for large-scale chip multiprocessors. Our work is motivated by the large non-uniform distribution of memory accesses across
cache sets in different L2 banks. DPAP decouples the physical locations of cache blocks from their addresses for the sake of reducing
misses caused by destructive interferences. Temporal pressure at the on-chip last-level cache, is continuously collected at a group
(comprised of local cache sets) granularity, and periodically recorded at the memory controller(s) to guide the placement process. An
incoming block is consequently placed at a cache group that exhibits the minimum pressure. Simulation results using a full-system
simulator demonstrate that DPAP outperforms the baseline shared NUCA scheme by an average of 8.3% and by as much as 18.9%
for the benchmark programs we examined. Furthermore, evaluations showed that DPAP outperforms related cache designs.
Index Terms—Chip Multiprocessors, Associative Placement, Pressure-Aware Placement, Aggregate Cache Sets, Local Cache Sets.
✦
1 I NTRODUCTION AND MOTIVATION
A
S large uniprocessors are no longer scaling in performance, chip
multiprocessors (CMPs) have become the trend in computer
architecture. CMPs can easily spread multiple threads of execution
across various cores. Besides, CMPs scale across generations of
silicon process simply by stamping down copies of the hard-to-design
cores on successive chip generations [9]. One of the key challenges to
obtaining high performance from CMPs is organizing and managing
the limited on-chip cache resources (typically the L2 cache) shared
among co-scheduled threads/processes.
Tiled chip multiprocessor (CMP) architectures have recently been
advocated as a scalable processor design approach [4], [14]. They
replicate identical building blocks (tiles) and connect them with a
switched network on-chip (NoC) [14]. A tile typically incorporates
private L1 caches and an L2 cache bank. L2 cache banks are accord-
ingly physically distributed over the processor chip. A conventional
practice, referred to as the shared scheme, logically shares these
physically distributed cache banks. On-chip access latencies differ
depending on the distances between requester cores and target banks
creating a Non Uniform Cache Architecture (NUCA) [7].
Recent research work on CMP cache management has recognized
the importance of the NUCA shared design [3], [4]. Besides, many
of today’s multi-core processors, the Intel Core
TM
2 Duo processor
family [12], Sun Niagara [8], and IBM Power5 [16], have featured
shared caches. A shared organization, however, suffers from an
interference problem. A defectively behaving application can evict
useful L2 cache content belonging to other co-scheduled programs.
As such, a program that exposes temporal locality can experience
frequent cache misses caused by interferences. We observe that 69.5%
of misses on a 16-way tiled shared CMP platform are inter-processor
(a line being replaced at an earlier time by a different processor).
1
We primarily correlate destructive interferences problem to the
root of CMP cache management, the cache placement algorithm.
Manuscript submitted: 18-Apr-2010. Manuscript accepted: 14-May-2010.
Final manuscript received: 18-May-2010.
This work was supported in part by NSF grant CCF-0952273.
1. Section 4.1 describes the adopted CMP platform, the experimental
parameters, and the benchmark programs we examined.
Fig. 1. Number of misses per 1 million instructions (MPMI) experienced by
two local cache sets (the ones that experience the max and the min misses)
at different aggregate sets for two benchmarks, Swaptions and MIX2.
Fig. 1 demonstrates the number of misses per 1 million instructions
experienced by cache sets across L2 cache banks (or aggregate
sets) for two benchmarks, Swaptions and MIX2 (see Section 4.1 for
experimental details). We define an aggregate set with index i as the
union of sets with index i across L2 cache banks. More formally,
an aggregate set
i
=
n
k=1
set
ki
where set
ki
is the set with index i
at bank k. We refer to each set
ki
as a local set. We assume a 16-
way tiled CMP platform with physically distributed, logically shared
L2 banks. We only show results for two local sets that exhibit the
maximum and the minimum misses, in addition to the average misses,
per each aggregate set. Clearly, we can see that memory accesses
across aggregate sets are asymmetric. A placement strategy aware
of the current pressures at banks can reduce the workload imbalance
among aggregate sets by preventing placing an incoming cache block
at an exceedingly pressured local set. This can potentially minimize
interference misses and maximize system performance.
Traditionally, cache blocks are stored at cache locations solely
based on their physical addresses. This makes the placement process
unaware of the disparity in the hotness of the shared cache sets.
In this work, we explain the importance of incorporating pressure-
aware associative placement strategies to improve CMP system per-
formance. We propose dynamic pressure-aware associative placement
(DPAP), a novel mechanism that involves a low-hardware overhead
framework to monitor the L2 cache banks at a group (comprised
of local cache sets) granularity and record pressure information at an
array embedded within the memory controller. The collected pressure
Posted to the IEEE & CSDL on 5/25/2010
DOI 10.1109/L-CA.2010.7 1556-6056/10/$26.00 © 2010 Published by the IEEE Computer Society
IEEE COMPUTER ARCHITECTURE LETTERS, VOL. 9, NO. 1, JANUARY-JUNE 2010 29