Improving Data Cache Performance using Persistence Selective Caching Sumeet S. Kumar, Rene van Leuken Circuits and Systems Group, Faculty of EEMCS, Delft University of Technology, The Netherlands {s.s.kumar, t.g.r.m.vanleuken}@tudelft.nl Abstract—This paper presents Persistence Selective Caching (PSC), a selective caching scheme that tracks the reusability of L1 data cache (L1D) lines at runtime, and moves lines with sufficient potential for reuse to a low-latency, low-energy assist cache from where subsequent references to them are serviced. The selectivity of PSC is configurable, and can be adjusted to suit the varying memory access characteristics of different applications, unlike existing schemes. By effectively identifying reusable cache lines and storing them in the assist, PSC reduces average memory access time by upto 59% as compared to competing schemes and conventional data caches. Furthermore, by ensuring that only reusable lines are cached by the assist, PSC reduces cache line movements, and thus decreases average energy per access by upto 75% over other assists. Keywordscache memory, memory architecture, memory man- agement, microprocessors I. I NTRODUCTION The limited size of processor caches in comparison to the data set of modern applications leads to the emergence of expensive misses, necessitating high latency and energy consuming accesses to lower levels of the memory hierarchy. Although large set-associative caches appreciably reduce miss rates, their size causes them to have a higher hit-latency, and consume more energy per access than smaller direct-mapped caches. This paper presents the Persistence Selective Caching (PSC) scheme which reduces average memory access time (AMAT) through the selective caching of reusable lines in a small, fully-associative assist cache. The reuse potential of a line is estimated at runtime based on its access persistence, i.e. the number of accesses to the line within a certain window of data references by the processor. Lines with sufficient access persistence are moved from the L1 data cache (L1D) into the assist cache from where subsequent references to them are serviced. Due to the assist’s small size, these references only incur a short access latency, and consume considerably lesser energy than an L1D access. PSC’s selectivity ensures that only the most reusable lines are moved to the assist, leading to a significant reduction in the number of cache line movements (swaps), and thus lower energy per access than competing schemes. The significant contributions of this paper are: A configurable scheme that tracks the access persis- tence of cache lines at runtime, and selectively caches those with sufficient persistence in a low-latency, This research was supported in part by the CATRENE programme under the Computing Fabric for High Performance Applications (COBRA) project CA104. low-energy assist cache. The selectivity of PSC is configurable, and allows the scheme to be adjusted to suit the varying memory access characteristics of different applications, unlike existing schemes. Illustration of the performance and energy benefits of selective assist caching. PSC reduced AMAT by upto 59% and average energy per access by upto 75% as compared to conventional data caches and competing assists [1][2]. This paper is organized as follows: In Section II, we review the state of the art in cache assists, and outline the motivation for PSC. In Section III, we describe the architecture and algorithms of PSC, and in Section IV, evaluate its effectiveness in reducing AMAT and energy per access. II. RELATED WORK A number of studies have, in the past, used small memory buffers to augment the capacity of the main L1D, and thus improve performance and energy consumption. In this paper, such memory structures are referred to as assist caches. The Filter cache [3] for instance is an assist that reduces energy consumption for cache memory accesses by using a very small memory buffer in between the processor and L1D. However, these energy savings are obtained at the cost of increased access latencies, and thus higher average memory access time (AMAT). The Victim Cache (VC) [4] on the other hand aims to decrease AMAT by reducing the cost of conflict misses. The VC stores victims of L1D evictions such that in the event of future references to them, the lines can be returned to the L1D in a single cycle rather than through a long latency, energy consuming cache miss. On a VC hit, the requested line is moved to the L1D, and the corresponding entry from the L1D evicted to the VC. This swap operation constitutes an energy overhead, and is a significant disadvantage of the victim cache. Stiliadis et al. overcame this disadvantage with their proposal, Selective Victim Caching (SVC) [1]. In SVC, the swap operation is prevented from occurring if the incumbent L1D cache line is found to be more reusable than the requested VC line. SVC considerably reduces the number of swaps as compared to a conventional victim cache with the same miss rate and latency improvements. However, these proposals consider the L1D as the primary target for data references by the processor, and the assist as an auxiliary cache. A majority of references are thus serviced by the larger L1D cache, and consequently, the relatively shorter latency and energy per access of the assist cache remain 978-1-4799-3432-4/14/$31.00 ©2014 IEEE 1945