Lookahead Prefetching with Signature Path Jinchun Kim, Paul V. Gratz and A. L. Narasimha Reddy Electrical and Computer Engineering, Texas A& M University cienlux@tamu.edu, pgratz@gratz1.com, reddy@tamu.edu Abstract—Existing data prefetchers speculate on spatial and temporal locality by tracking past memory accesses. Relying on the past memory accesses restricts the scope of prefetching and potentially further performance improvement. In this paper, we propose a lookahead prefetching algorithm called Signature Path Prefetching (SPP) that accurately predicts the next memory access pattern and exploits this future access to initiate lookahead prefetching. Unlike prior lookahead al- gorithms, SPP is purely based on the memory access stream and does not require additional support from branch history, PC, or metadata to lookahead future memory access. Within a 32KB storage limit, we evaluate SPP under different memory constrained scenarios and find SPP outperforms the previous competition winner AMPM prefetcher by 4% performance improvement. I. I NTRODUCTION Data prefetching can provide an efficient means to im- prove the performance of modern microprocessors. The aim of the technique is to proactively fetch useful data blocks from long-latency off-chip DRAM to the faster on-chip SRAM cache ahead of demand access. Typically, prefetching techniques predict the future access pattern based on past memory accesses. Thus, prefetching hardware speculates on spatial and temporal locality based upon learning of past pro- gram behavior. In these traditional prefetching techniques, prediction is inherently limited by the number of past access patterns that have been monitored. Moreover, prefetching should be very accurate. Even if there is a prefetcher that can hold limitless amount of information, prefetched data will pollute the cache if data is unused or untimely. Therefore, it is highly desirable to develop a prefetching algorithm that can predict many future accesses with high accuracy. To address both prefetching scope and accuracy, prior works adopted lookahead mechanism into data prefetch- ing [2], [6], [7]. Previous studies, however, suffer from high hardware complexity and require additional support from the core pipeline. For example, B-Fetch requires branch history and a copy of architectural register file to perform lookahead prefetching [2]. Although this extra information shows po- tential to improve performance, exporting this information down to the lower level caches implies implementation challenges as this information is not typically required in low-level caches [3]. In this work, we propose a simple but powerful looka- head prefetching algorithm called Signature Path Prefetching (SPP) that aggressively speculates beyond the current de- mand memory access and traverses down the future memory L1 Cache (16KB, LRU) L2 Cache (128KB, LRU) L3 Cache (1MB, LRU) Off-Chip DRAM Trained by L2 cache access Issue prefetch Update filter SPP Module Signature Table (ST) Pattern Table (PT) Prefetch Engine (PE) Demand Prefetch Figure 1: Overall SPP architecture access pattern that is likely to be used. Unlike prior looka- head prefetchers [2], [6], [7], SPP does not require additional support from branch information, PC, or cache metadata and is purely based on the memory access stream. Since there are no hooks between core pipeline and SPP, the prefetching engine can operate as a stand alone module without the complexities of exporting core information. We evaluate SPP with 16 different SPEC CPU 2006 benchmarks and achieve 26.1% performance improvement compared to a processor without prefetching. Moreover, SPP outperforms AMPM prefetcher [1], the winner of the previous data prefetching competition by 4% on average. II. DESIGN The high level design of the SPP engine is illustrated in Figure 1. The SPP module is a three-stage, pipelined structure that consists of a Signature Table (ST) stage, a Pattern Table (PT) stage, and the Prefetch Engine (PE). SPP is trained by L2 cache accesses (L1 Misses) and issues prefetch requests into the L2 read queue. The ST stage is indexed by physical page number (PPN) and stores the previously seen memory access pattern as a compressed 12- bit signature. The PT stage is indexed by a history signature generated from the ST stage and stores future access stride patterns. The PT stage also estimates the probability that a given access stride pattern will yield a useful prefetch. If the stride in the PT is found to have sufficient probability (above a configured threshold), this pattern is passed to the PE for prefetch generation. As noted in Figure 1, prefetching puts additional pressure on both cache and DRAM accesses. Therefore, it is important to detect redundant prefetching requests and filter them out properly. To avoid unnecessary prefetching requests, we implement a filter at the PE stage.