A Building Block for Coarse-Grain Optimizations in the On-Chip Memory Hierarchy Jason Zebchuk and Andreas Moshovos Electrical and Computer Engineering University of Toronto Abstract Current on-chip block-centric memory hierarchies exploit access patterns at the fine-grain scale of small blocks. Several recently proposed memory hierarchy enhancements for coherence traffic reduction and prefetching suggest that additional useful patterns emerge with a macroscopic, coarse-grain view. This paper presents RegionTracker, a dual-grain, on-chip cache design that exposes coarse-grain behavior while maintaining block-level communication. RegionTracker eliminates the extraneous, often imprecise coarse-grain tracking structures of previous proposals. It can be used as the building block for coarse-grain optimizations, reducing their overall cost and easing their adoption. Using full-system simulation of a quad-core chip multiprocessor and commercial workloads, we demonstrate that RegionTracker overcomes the inefficiencies of previous coarse-grain cache designs. We also demonstrate how RegionTracker boosts the benefits and reduces the cost of a previously proposed snoop reduction technique. I. INTRODUCTION Future on-chip caches will most likely grow to several tens of megabytes to compensate for limited off-chip bandwidth and large application footprints, and to meet the demands of multiprocessing and multithreading. This unprecedented on-chip storage offers unique opportunities for new improvements beyond conventional cache designs. Our thesis is that coarse-grain tracking and management, i.e., tracking information about multiple blocks belonging to coarser memory regions and managing the corresponding blocks, becomes increasingly appealing as caches grow larger. Our motivation is that a macroscopic view of access behavior reveals useful patterns that are hard to discern in existing fine-grain cache hierarchies. Several works corroborate this observation. Region information has been shown to facilitate: (1) performance, bandwidth and power improvements for snoop-coherent shared memory multi- processors [1,9], and (2) prefetching for applications with demanding memory footprints [2,13]. These techniques rely on two types of information: 1) whether any block in a region is cached [1,9], and 2) which specific blocks of a region are cached [2,13]. Since region information is not readily available in existing caches, previous work relied on separate structures to track and manage it. These structures are imprecise [9], incomplete [13], or restrict data placement [1,2]. Moreover, their relative cost can be as high as an additional area of 60% compared to a conventional tag array [2]. Commercial designs are more likely to incorporate these optimizations if they have lower complexity and area cost. Wo observe that these optimizations require much of the same, or similar, functionality. Accordingly, we present RegionTracker (RT), a building block for such optimizations that reduces the overhead and eliminates the imprecision of these extraneous tracking structures. As an example of RT’s potential we domesticate that it improves the performance and reduces the cost of a snoop-reduction method. RT introduces region-level functionality without compromising performance, power or area compared to a conventional cache. A single lookup in RT is sufficient to determine which, if any blocks of a region are cached and where. Moreover, region-level management such as region invalidation, migration, and replacement are naturally supported; although RT still uses fine-grain block communication to avoid a bandwidth explosion. Accordingly, RT is a dual-grain cache design. The RT design methodology starts with a conventional cache and replaces the tag array with a slightly smaller structure that facilitates region-level lookups and management without changing overall performance. RT builds upon previous sector cache designs, improving their performance and adding functionality. Compared to previous designs, RT offers simple block and region lookups and replacements, without requiring higher associativity or hurting performance, latency, complexity or area. Sections II and III explain the key challenges in building a dual-grain cache, how RT meets these challenges, and how it compares to previous designs. The key contributions of this work are twofold. First, it articulates why incorporating region information and management in the on-chip hierarchy should be a priority in modern designs. Second, it presents a framework for incorporating region-level optimizations in the on-chip hierarchy “for free”, that is without requiring more area or hurting performance or power. II. REQUIREMENTS Our starting point is a conventional cache whose performance and area have been tuned appropriately. Our goal is to replace only the tag array so that we can inspect and manipulate regions comprising several blocks without impacting performance. Ideally, compared to a Manuscript submitted: 08-May-2007. Manuscript accepted: 31-May- 2007. Final manuscript received: 07-June-2007.