Pointer-Based Prefetching within the Impulse Adaptable Memory Controller: Initial Results Lixin Zhang, Sally A. McKee, Wilson C. Hsieh, and John B. Carter Department of Computer Science University of Utah {lizhang, sam, wilson, retrac}@cs.utah.edu http://www.cs.utah.edu/impulse/ Abstract Prefetching has long been used to mask the latency of memory loads. This paper presents results for an ini- tial implementation of pointer-based prefetching within the Impulse adaptable memory controller. We conduct our experiments on a four-way issue superscalar ma- chine. For the microbenchmarks we examine, we con- sistently realize about a 20% improvement in execution time for linked data structures accessed within medium to short loop iterations. This compares favorably to soft- ware prefetching when the data working set fits in cache, and exceeds the performance of the latter technique for large working sets. We also find that a superscalar, out- of-order processor hides the memory latency of linked data structures accessed in large loop iterations excep- tionally well, which makes any pointer prefetching un- necessary. 1 Introduction Prefetching has long been used to mask the latency of memory loads. This paper presents results for an initial implementation of pointer-based prefetching within the Impulse Adaptable Memory Controller system: when- ever the memory controller sees a request for a node in a linked data structure, it prefetches all objects directly pointed to by the this node. Our results show that under some circumstances, traditional software prefetching is preferable to memory-controller based prefetching, but This effort was sponsored in part by the Defense Advanced Re- search Projects Agency (DARPA) and the Air Force Research Labora- tory (AFRL) under agreement number F30602-98-1-0101 and DARPA Order Numbers F393/00-01 and F376/00. The views and conclusions contained herein are those of the authors and should not be interpreted as necessarily representing the official polices or endorsements, either express or implied, of DARPA, AFRL, or the US Government. in many situations, the memory-controller based scheme or a combination of the two approaches wins. Prefetching can be performed by software, hardware, or a combination of both. The purely software approach relies on a compiler to generate instructions to preload data [23, 20], or an application writer to modify source code to achieve the desired behavior [3, 17, 25]. Hybrid approaches provide hardware support for such prefetch operations. For instance, they might augment the ISA with a prefetch instruction [10], redefine a load to a specific register (e.g., to register 0, as in the PA-RISC architectures [15]), or provide programmable prefetch engines [6] or programmable stream buffers [19]. Hardware-only prefetching [2, 9, 12, 14, 29] thus has the advantage of being transparent, and some commer- cial machines include such mechanisms [5, 7, 28]. How- ever, due to its speculative nature, care must be taken to keep from lowering application performance by increas- ing contention in the caches and wasting bus bandwidth on useless prefetches. Most prefetching research in the literature focuses on fetching data structures with regular access patterns, such as streams or arrays. Some of these require that stream patterns be detected dynamically, as in the vector prefetch units proposed by Baer and Chen [2], Fu and Patel [13], and Sklenar [29]. Cache-based approaches, such as the sequential hardware prefetching of Dahlgren et al. [9], eliminate the need for detecting strides dy- namically. To minimize the number of unnecessary prefetches, the prefetch distance of these run-time tech- niques is generally limited to a few loop iterations (or a few cache lines). When prefetching to cache, one risks the possibilities that the prefetched data may re- place other needed data or may be evicted before it is used. Contention in the on-chip cache hierarchy can be avoided by buffering prefetched data lower in the mem- ory system. For instance, the Stream Memory Controller 1