COMPASS: A Programmable Data Prefetcher Using Idle GPU Shaders Dong Hyuk Woo Hsien-Hsin S. Lee School of Electrical and Computer Engineering Georgia Institute of Technology Atlanta, GA 30332 dhwoo@ece.gatech.edu, leehs@gatech.edu Abstract A traditional fixed-function graphics accelerator has evolved into a programmable general-purpose graphics processing unit over the last few years. These powerful computing cores are mainly used for accelerating graphics applications or enabling low-cost scientific computing. To further reduce the cost and form factor, an emerg- ing trend is to integrate GPU along with the memory controllers onto the same die with the processor cores. However, given such a system-on-chip, the GPU, while occupying a substantial part of the silicon, will sit idle and contribute nothing to the overall sys- tem performance when running non-graphics workloads or appli- cations lack of data-level parallelism. In this paper, we propose COMPASS, a compute shader-assisted data prefetching scheme, to leverage the GPU resource for improving single-threaded per- formance on an integrated system. By harnessing the GPU shader cores with very lightweight architectural support, COMPASS can emulate the functionality of a hardware-based prefetcher using the idle GPU and successfully improve the memory performance of single-thread applications. Moreover, thanks to its flexibility and programmability, one can implement the best performing prefetch scheme to improve each specific application as demonstrated in this paper. With COMPASS, we envision that a future application vendor can provide a custom-designed COMPASS shader bundled with its software to be loaded at runtime to optimize the perfor- mance. Our simulation results show that COMPASS can improve the single-thread performance of memory-intensive applications by 68% on average. Categories and Subject Descriptors I.3.1 [Computer Graphics]: Hardware Architecture—Graphics processors; B.3.2 [Memory Structures]: Design Styles—Cache memories General Terms Design, Experimentation, Performance Keywords GPU, Compute Shader, Prefetch 1. Introduction To meet the modern needs of game developers, a traditional fixed- function graphics accelerator has evolved into a programmable Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. ASPLOS’10, March 13–17, 2010, Pittsburgh, Pennsylvania, USA. Copyright c 2010 ACM 978-1-60558-839-1/10/03. . . $10.00 graphics processing unit (GPU), which allows game developers to write their own shaders for specific special effects. For its vast com- putational capability, a modern GPU is also designed to run non- graphics, compute-intensive applications, referred to as general- purpose GPU (GPGPU) [24]. Recently, Intel and AMD announced their integrated solutions to encompass the GPU, the memory con- troller, and the CPU onto a single die for netbook, laptop, and desk- top products [28, 36]. Although the integrated chip is not likely to be as powerful as a standalone CPU or GPU due to several rea- sons such as power budget, it lowers the overall system cost and reduces the form factor with reasonable performance for its particu- larly aimed applications and market. Furthermore, the performance can be compensated to some extent due to the substantially reduced latency between the host CPU and the integrated GPU. Unfortunately, while the host CPU executes the sequential part of a parallelized application or an unparallelized legacy application, the integrated GPU will sit idle contributing nothing to the single- thread performance. Unlike symmetric multi-core processors in which many sequential processes can concurrently run on multiple cores, an idle GPU cannot run a conventional CPU process due mainly to the heterogeneity between the ISAs. Moreover, an idle GPU cannot take advantage of other types of techniques, such as speculative multi-threading or helper threads [2, 6, 9, 14, 22, 25, 29, 37], to boost single-thread performance unless the GPU is completely re-designed to support it, which could unnecessarily complicate the entire design and lead to performance degradation when running conventional graphics applications. One way to improve the performance of a CPU while an on-chip GPU is idle is to exploit the remaining power budget. Because an idle GPU only consumes a small amount of idle power compared to an active GPU, the CPU can then be given the unused power by increasing its supply voltage and clock frequency, similar to the Turbo mode employed in Intel’s Core i7 (Nehalem) processor [18]. Nonetheless, this method will not improve the performance of memory-intensive, single-thread applications, which are typically unscalable and insensitive to clock frequency. Instead of letting the GPU sit idle, we envision that the OS can utilize the idle GPU to run compute shaders to enhance the memory performance for single-thread applications. In this paper, we propose COMPASS, a compute shader-assisted prefetching scheme, to achieve our goal. With very lightweight architectural support, we demonstrate that COMPASS can enhance the single- thread performance of an integrated CPU by emulating the function of a hardware prefetcher using the programmable shader. The rest of this paper is organized as follows: Section 2 de- scribes the details of the GPU architecture used as the baseline of this paper. Section 3 explains the general design of COMPASS, and Section 4 details the design and trade-off of various COMPASS 297