Automatic OpenCL Device Characterization: Guiding Optimized Kernel Design Peter Thoman, Klaus Kofler, and John Thomson University of Innsbruck Abstract. The OpenCL standard allows targeting a large variety of CPU, GPU and accelerator architectures using a single unified program- ming interface and language. While the standard guarantees portability of functionality for complying applications and platforms, performance portability on such a diverse set of hardware is limited. Devices may vary significantly in memory architecture as well as type, number and com- plexity of computational units. To characterize and compare the OpenCL performance of existing and future devices we propose a suite of mi- crobenchmarks, uCLbench. We present measurements for eight hardware architectures – four GPUs, three CPUs and one accelerator – and illustrate how the results accu- rately reflect unique characteristics of the respective platform. In addi- tion to measuring quantities traditionally benchmarked on CPUs like arithmetic throughput or the bandwidth and latency of various address spaces, the suite also includes code designed to determine parameters unique to OpenCL like the dynamic branching penalties prevalent on GPUs. We demonstrate how our results can be used to guide algorithm design and optimization for any given platform on an example kernel that represents the key computation of a linear multigrid solver. Guided manual optimization of this kernel results in an average improvement of 61% across the eight platforms tested. 1 Introduction The search for higher sustained performance and efficiency has, over re- cent years, led to increasing use of highly parallel architectures. This move- ment includes GPU computing, accelerator architectures like the Cell Broad- band Engine, but also the increased thread- and core-level parallelism in clas- sical CPUs [9]. In order to provide a unified programming environment capa- ble of effectively targeting this variety of devices, the Khronos group proposed the OpenCL standard. It includes a runtime API to facilitate communication with devices and a C99-based language specification for writing device code. Currently, many hardware vendors provide implementations of the standard, including AMD, NVIDIA and IBM. The platform model for OpenCL comprises a host – the main computer – and several devices featuring individual global memory. Computation is performed by invoking data-parallel kernels on an N-dimensional grid of work items. Each point in the grid is mapped to a processing element, and elements are grouped in compute units sharing local memory. Broad acceptance of the standard leads to the interesting situation where vastly different hardware architectures can be targeted with essentially unchanged code. However, implementations suited well to one platform may – because of seemingly small architectural differences – fail