MRU-Tour-based Replacement Algorithms for Last-Level Caches Alejandro Valero, Julio Sahuquillo, Salvador Petit, Pedro L´ opez, and Jos´ e Duato Department of Computer Engineering Universitat Polit` ecnica de Val` encia Valencia, Spain alvabre@gap.upv.es, {jsahuqui, spetit, plopez, jduato}@disca.upv.es Abstract Memory hierarchy design is a major concern in current microprocessors. Many research work focuses on the Last- Level Cache (LLC), which is designed to hide the long miss penalty of accessing to main memory. To reduce both ca- pacity and conflict misses, LLCs are implemented as large memory structures with high associativities. To exploit temporal locality, LRU is the replacement al- gorithm usually implemented in caches. However, for a high-associative cache, its implementation is costly in terms of area and power consumption. Indeed, LRU is not well suited for the LLC, because as this cache level does not see all memory accesses, it cannot cope with temporal locality. In addition, blocks must descend down to the LRU position of the stack before eviction, even when they are not longer useful. In this paper, we show that most of the blocks are not ref- erenced again once they leave the MRU position. Moreover, the probability of being referenced again does not depend on the location on the LRU stack. Based on these obser- vations, we define the number of MRU-Tours (MRUTs) of a block as the number of times that a block occupies the MRU position while it is stored in the cache, and propose the MRUT replacement algorithm, which selects the block to be replaced among the blocks that show only one MRUT. Variations of this algorithm have been also proposed to ex- ploit both MRUT behavior and recency of information. Experimental results show that, compared to LRU, the proposal reduces the MPKI up to 22%, while IPC is im- proved by 48%. 1 Introduction Computer architects have implemented cache memo- ries [?] since late 1960s to mitigate the huge gap between processor and main memory speed. This problem was orig- inally solved by using a single cache, but as the memory gap continued growing, several cache levels were necessary for performance. The first level (L1 cache) is the closest to the processor and it is designed for speed, while the second or the third level (if any) is referred to as LLC (Last-Level Cache) and it is designed to hide as much as possible the long miss penalty of accessing to main memory, which in- volves several hundreds of processor cycles in current mi- croprocessors. The system performance strongly depends on the cache hierarchy performance. Thus, many research has been done to improve the cache performance, although usually focus- ing on a given level of the cache hierarchy (i.e., L1, L2 or L3). Techniques like load-bypassing, way-prediction, or prefetching, have been widely investigated and imple- mented in many commercial products. Although these techniques have been successfully implemented in typical monolithic processors, the pressure on the memory con- troller is much higher in multicore and manycore systems than in monolithic processors. Therefore, the performance of the cache hierarchy in general, and the performance of the LLC in particular, is a major design concern in current microprocessors. LLCs are designed as very large structures in order to keep as much information as possible so reducing capacity misses, whose sizes range from several hundreds of KB up to several tens of MB [?][?]. Moreover, this storage capac- ity is expected to grow as transistor features will continue shrinking in future technology generations. In addition, in order to keep low the number of conflict misses, current LLCs implement a large number of ways (e.g., 16 ways). Typically, caches exploit temporal locality by imple- menting the Least Recently Used (LRU) replacement algo- rithm. This algorithm acts as a stack that places the Most Recently Used (MRU) block on the top of the stack and the LRU block on the bottom, which is the evicted block when space is required. Although this algorithm works well in L1 caches with a low number of ways; with high associativities, like 8- and 16-ways that are encountered in current LLCs, traditional LRU is too expensive to implement. Therefore, approximations to LRU are the norm but their performance start to deviate from the traditional LRU [?]. On the other