Glinda * : A Framework for Accelerating Imbalanced Applications on Heterogeneous Platforms Jie Shen, Ana Lucia Varbanescu, Henk Sips Parallel and Distributed Systems Group Delft University of Technology, The Netherlands j.shen@tudelft.nl Michael Arntzen † , Dick G. Simons Air Transport & Operations Aerospace Engineering Delft University of Technology, The Netherlands michael.arntzen@nlr.nl ABSTRACT Heterogeneous platforms integrating different processors like GPUs and multi-core CPUs become popular in high perfor- mance computing. While most applications are currently using the homogeneous parts of these platforms, we argue that there is a large class of applications that can bene- fit from their heterogeneity: massively parallel imbalanced applications. Such applications emerge, for example, from variable time step based numerical methods and simulations. In this paper, we present Glinda, a framework for acceler- ating imbalanced applications on heterogeneous computing platforms. Our framework is able to correctly detect the application workload characteristics, make choices based on the available parallel solutions and hardware configuration, and automatically obtain the optimal workload decompo- sition and distribution. Our experiments on parallelizing a heavily imbalanced acoustic ray tracing application show that Glinda improves application performance in multiple scenarios, achieving up to 12× speedup against manually configured parallel solutions. Categories and Subject Descriptors C.4 [Performance of Systems]; D.2.6 [Programming Environments]: Integrated environments; C.1.3 [Processor Architectures]: Heterogeneous (hybrid) systems General Terms Design, Performance Keywords Heterogeneous systems, GPUs, Multi-core processors, OpenCL, Acoustic ray tracing 1. INTRODUCTION GPUs (Graphics Processing Units) and GPGPU program- ming (General-Purpose GPU programming) keep gaining * Glinda is a good witch in the Wizard of Oz. We picked it as our framework name for its magic. † also affiliated with Dutch National Aerospace Laboratory. popularity in parallel computing [17]. Multiple applications with massive parallelism (e.g., image processing, games, big graph processing, scientific computation) have been signif- icantly accelerated on GPUs [19, 14, 11, 25]. The lead- ing GPU vendors, NVIDIA and AMD, have been continu- ously releasing new GPU products and updating develop- ment tools, aiming to improve GPU computing and make these applications run even faster. With the rise of GPGPU programming, heterogeneous computing integrating GPUs and CPUs has also become attractive. In this case, the application is divided to run on different processors in a cooperative way, taking the advan- tage of both the GPU and the multi-core CPU. In addition, single chip hybrid CPU+GPU architectures, such as Intel Sandybridge [12] and AMD Fusion [1], have been launched, becoming a strong incentive for heterogeneous computing. In order to program heterogeneous systems in a unified way, the OpenCL (Open Computing Language) program- ming model is a good solution [7]. OpenCL is designed as a virtual computing platform, consisting of a host con- nected to one or more compute devices (e.g., GPUs, multi- core CPUs). The user programs this “virtual” platform, and the resulting code is able to work on multiple (types of) de- vices. In OpenCL, a compute device is divided into multiple compute units (CUs); CUs are further divided into multiple processing elements (PEs); PEs perform the computation (compute kernels). An instance of a compute kernel is called a work-item, and it is executed for each point in the problem space. Further, work-items are organized into work-groups for OpenCL execution management. From the application perspective, applications that achieve high parallel performance on GPUs are usually massively parallel and balanced (i.e., each data element has a rela- tively similar computation workload, Figure 1 (a)). How- ever, applications can have imbalanced workloads (Figure 1 (b)), and such cases are not rare. For example, many simu- lation based scientific applications use variable time step as a technique to ensure sufficient simulation accuracy, effec- tively generating different workloads for different simulation points. Note that this imbalance is typically generated by some sort of data-dependent behavior, making imbalanced applications a subset of irregular applications. In balanced applications, the whole computation can be evenly distributed among all processing cores. They start and finish computation at similar pace, ensuring a high core occupancy and utilization. Thus, a homogeneous (massively) parallel platform is suitable for such cases (a GPU or a multi- core CPU). When an application has imbalanced workloads,