Optimizing MapReduce for GPUs with Effective Shared Memory Usage Linchuan Chen Gagan Agrawal Department of Computer Science and Engineering The Ohio State University Columbus, OH 43210 {chenlinc,agrawal}@cse.ohio-state.edu ABSTRACT Accelerators and heterogeneous architectures in general, and GPUs in particular, have recently emerged as major players in high perfor- mance computing. For many classes of applications, MapReduce has emerged as the framework for easing parallel programming and improving programmer productivity. There have already been sev- eral efforts on implementing MapReduce on GPUs. In this paper, we propose a new implementation of MapReduce for GPUs, which is very effective in utilizing shared memory, a small programmable cache on modern GPUs. The main idea is to use a reduction-based method to execute a MapReduce applica- tion. The reduction-based method allows us to carry out reductions in shared memory. To support a general and efficient implemen- tation, we support the following features: a memory hierarchy for maintaining the reduction object, a multi-group scheme in shared memory to trade-off space requirements and locking overheads, a general and efficient data structure for the reduction object, and an efficient swapping mechanism. We have evaluated our framework with seven commonly used MapReduce applications and compared it with the sequential im- plementations, MapCG, a recent MapReduce implementation on GPUs, and Ji et al.’s work, a recent MapReduce implementation that utilizes shared memory in a different way. The main observa- tions from our experimental results are as follows. For four of the seven applications that can be considered as reduction-intensive ap- plications, our framework has a speedup of between 5 and 200 over MapCG (for large datasets). Similarly, we achieved a speedup of between 2 and 60 over Ji et al.’s work. Categories and Subject Descriptors D.1.3 [Software]: PROGRAMMING TECHNIQUESConcurrent Programming[Parallel programming] General Terms Performance Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. HPDC’12, June 18–22, 2012, Delft, The Netherlands. Copyright 2012 ACM 978-1-4503-0805-2/12/06 ...$10.00. Keywords GPU, MapReduce, shared memory 1. INTRODUCTION The work presented in this paper is driven by two recent but independent trends. First, within the last 3-4 years, GPUs have emerged as the means for achieving extreme-scale, cost-effective, and power-efficient high performance computing. On one hand, some of the fastest machines in the world today are based on NVIDIA GPUs. At the same time, the very favorable price to performance ratio offered by the GPUs is bringing supercomputing to the masses. It is common for the desktops and laptops today to have a GPU, which can be used for accelerating a compute-intensive applica- tion. The peak single-precision performance of a NVIDIA Fermi card today is more than 1 Teraflop, giving a price to performance ratio of $2-4 per Gigaflop. Yet another key advantage of GPUs is the very favorable power to performance ratio. Second, the past decade has seen an unprecedented data growth as information is being continuously generated in digital format. This has sparked a new class of high-end applications, where there is a need to perform efficient data analysis on massive datasets. Such applications, with their associated data management and ef- ficiency requirements, define the term Data-Intensive SuperCom- puting (DISC) [3]. The growing prominence of data-intensive ap- plications has coincided with the emergence of the MapReduce paradigm for implementing this class of applications [6]. The MapReduce abstraction has also been found to be suitable for specifying a number of applications that perform significant amount of computation (e.g. machine learning and data mining algorithms). These applications can be accelerated using GPUs or other similar heterogeneous computing devices. As a result, there have been several efforts on supporting MapReduce on a GPU [4, 12, 13]. A GPU is a complex architecture and often significant effort is needed in tuning the performance of a particular application or framework on this architecture. Effective utilization of shared mem- ory, a small programmable cache on each multi-processor on the GPU has been an important factor for performance for almost all applications. In comparison, there has only been a limited amount of work in tuning MapReduce implementations on a GPU to effec- tively utilize shared memory [15]. This paper describes a new implementation of MapReduce for GPU, which is very effective in utilizing shared memory. The main idea is to perform a reduction-based processing of a MapReduce application. In this approach, a key-value pair that is generated is immediately merged with the current copy of the output results. For this purpose, a reduction object is used. Since the memory require-