Experiences with Mapping Non-linear Memory Access Patterns into GPUs Eladio Gutierrez, Sergio Romero, Maria A. Trenas, and Oscar Plata Department of Computer Architecture University of Malaga, Spain {eladio,sromero,maria,oplata}@uma.es Abstract. Modern Graphics Processing Units (GPU) are very power- ful computational systems on a chip. For this reason there is a growing interest in using these units as general purpose hardware accelerators (GPGPU). To facilitate the programming of general purpose applica- tions, NVIDIA introduced the CUDA programming environment. CUDA provides a simplified abstraction of the underlying complex GPU archi- tecture, so as a number of critical optimizations must be applied to the code in order to get maximum performance. In this paper we discuss our experience in porting an application kernel to the GPU, and all classes of design decisions we adopted in order to obtain maximum performance. 1 Introduction Driven by the huge computing demand of the graphics applications, Graphics Processing Units (GPU) have become highly parallel, multithreaded and many- core processors. Modern GPUs deliver a very large amount of raw performance that have drawn attention to the scientific community, with a growing interest in using these units to boost the performance of their compute-intensive ap- plications. That is, to use the GPUs as general-purpose hardware accelerators (General-Purpose Computation on GPUs, or GPGPU [2]). Developing GPGPU codes using the conventional graphics programming APIs is a very hard task and with many limitations. This situation motivated the development of general parallel programming environments for GPUs [11,12]. NVIDIA CUDA (Compute Unified Device Architecture) [11], one of the most widespread models, is built around a massively parallel SIMT (Single-Instruction, Multiple-Thread) execution model, supported by the NVIDIA GPU architec- ture [7], and provides a shared-memory, multi-threaded architectural model for general-purpose GPU programming [10]. CUDA provides a convenient and successful model at programming scalable multi-threaded many-core GPUs, across various problem domains [5]. However, the simplified abstraction that CUDA model provides does not permit to extract maximum performance from the underlying GPU physical architecture without applying a set of optimizations to the parallel code [8,13]. We can distinguish two classes of optimizations. The first class corresponds to techniques that fall within G. Allen et al. (Eds.): ICCS 2009, Part I, LNCS 5544, pp. 924–933, 2009. c Springer-Verlag Berlin Heidelberg 2009