Energy-efficient Computing on Distributed GPUs using Dynamic Parallelism and GPU-controlled Communication Lena Oden Fraunhofer Institute for Industrial Mathematics Competence Center High Perfomance Computing Kaiserslautern, Germany oden@itwm.fhg.de Benjamin Klenk and Holger Fr¨ oning Ruprecht-Karls University of Heidelberg Institute of Computer Engineering Heidelberg, Germany {klenk,froening}@uni-hd.de Abstract—GPUs are widely used in high performance comput- ing, due to their high computational power and high performance per Watt. Still, one of the main bottlenecks of GPU-accelerated cluster computing is the data transfer between distributed GPUs. This not only affects performance, but also power consumption. The most common way to utilize a GPU cluster is a hybrid model, in which the GPU is used to accelerate the computation while the CPU is responsible for the communication. This approach always requires an dedicated CPU thread, which consumes additional CPU cycles and therefore increases the power consumption of the complete application. In recent work we have shown that the GPU is able to control the communication independently of the CPU. Still, there are several problems with GPU-controlled communication. The main problem is intra-GPU synchronization, since GPU blocks are non- preemptive. Therefore, the use of communication requests within a GPU can easily result in a deadlock. In this work we show how Dynamic Parallelism solves this problem. GPU-controlled communication in combination with Dynamic Parallelism allows keeping the control flow of multi-GPU applications on the GPU and bypassing the CPU completely. Although the performance of applications using GPU-controlled communication is still slightly worse than the performance of hybrid applications, we will show that performance per Watt increases by up to 10% while still using commodity hardware. I. I NTRODUCTION During the last years, graphic processing units have gained high popularity in high performance computing. Programming languages like CUDA, OpenCL or directive-based approaches like OpenACC make the features of GPUs available for developers that are not familiar with the classic graphical aspects. Therefore, GPUs are deployed in an increasing number of HPC systems, especially since energy efficiency becomes more and more important due to technical, economical and ecological reasons. In particular, the first 15 systems of the Green500 from June 2014 are all accelerated with NVIDIA Kepler K20 GPUs [1]. For example, an Intel Xeon E5- 2687W Processor (8 cores, 3.4 GHz, AVX) achieves about 216 GFLOPS at a thermal design power (TDP) of about 150 W , resulting in 1.44 GFLOPS/W . An NVIDIA K20 GPU is specified with a TDP of 250 W and a single precision peak performance of 3.52 TFLOPS resulting in 14.08 GFLOPS/W . New features like CUDA Dynamic Parallelism help the GPU to become more independent of the CPU by allowing the GPU start and stop compute kernels without context switches to the host. By this the CPU can be relieved from this work, helping to save power for GPU-centric applications. GPUs are powerful, scalable many-core processors, but they excel in performance only if they can operate on data that is held in-core. Still, GPU memory is a scarce resource and also due to this, GPUs are deployed in clusters. However, communication and data transfer is one of the main bottlenecks of GPU-accelerated computing. Since we are now facing an area in which communication and data transfers dominate power consumption [2], it is necessary not only to optimize the computation of GPU-accelerated applications with regard to energy and time, it is even more important to optimize communication aspects. Applications that are running on distributed GPUs normally use a hybrid-programming model, in which computational tasks are accelerated by GPUs, while data transfer between the GPUs is controlled by the CPUs. This approach requires frequent context switches between CPU and GPU, and for the whole execution time a dedicated CPU thread is required to orchestrate GPU computations and GPU-related communica- tion. This CPU thread requires additional power and therefore increases the energy consumption of the complete application, preventing the CPU from entering sleep states. In recent work [3] we introduced a framework that allows GPUs to source and sink communication requests to Infiniband hardware and thereby completely to bypass the CPU. So far, this approach does not bring any performance benefits, but losses. This is caused by the work request generation on GPUs, which shows a much higher overhead compared to CPUs. Still, this technique allows to keep the control flow of an multi-GPU application on the GPU, avoiding context switches between CPU and GPU and relieving the CPU from communication work. Thus, GPU-controlled communication can help to reduce