Skeleton-based Automatic Parallelization of Image Processing Algorithms for GPUs Cedric Nugteren, Henk Corporaal, Bart Mesman Eindhoven University of Technology, The Netherlands {c.nugteren, h.corporaal, b.mesman}@tue.nl Abstract—Graphics Processing Units (GPUs) are becoming increasingly important in high performance computing. To main- tain high quality solutions, programmers have to efficiently parallelize and map their algorithms. This task is far from trivial, leading to the necessity to automate this process. In this paper, we present a technique to automatically par- allelize and map sequential code on a GPU, without the need for code-annotations. This technique is based on skeletonization and is targeted at image processing algorithms. Skeletonization separates the structure of a parallel computation from the algo- rithm’s functionality, enabling efficient implementations without requiring architecture knowledge from the programmer. We define a number of skeleton classes, each enabling GPU specific parallelization techniques and optimizations, including automatic thread creation, on-chip memory usage and memory coalescing. Recently, similar skeletonization techniques have been applied to GPUs. Our work uses domain specific skeletons and a finer- grained classification of algorithms. Comparing skeleton-based parallelization to existing GPU code generators in general, we potentially achieve a higher hardware efficiency by enabling algorithm restructuring through skeletons. In a set of benchmarks, we show that the presented skeleton- based approach generates highly optimized code, achieving high data throughput. Additionally, we show that the automatically generated code performs close or equal to manually mapped and optimized code. We conclude that skeleton-based parallelization for GPUs is promising, but we do believe that future research must focus on the identification of a finer-grained and complete classification. I. I NTRODUCTION Multi and many-core architectures are becoming a hot topic in the fields of computer architecture and high-performance computing. Processor design is no longer driven by high clock frequencies. Instead, more and more programmable cores are becoming available, even for low-power and embedded pro- cessors. At the same time, Graphics Processing Units (GPUs) are becoming increasingly programmable. While their graphics pipeline was completely fixed a few years ago, now, we are able to program all main processing elements using C-like languages. Multi-core and many-core architectures are emerging and becoming a commodity. Many experts believe heterogeneous processor platforms including both GPUs and CPUs will be the future trend [15]. Already now, signs of this trend are present when we look at programming languages such as NVIDIA’s CUDA or OpenCL. Languages such as CUDA are specifically designed for general purpose GPU programming. Even so, the mapping process of most applications remains non-trivial. Furthermore, there is only a tiny fraction of programmers able to achieve an optimal hardware usage for their applications. Achieving this is non-trivial, having to deal with mappings of dis- tributed memories, caches, and register files. Additionally, the programmer is exposed to parallel computing problems such as data-dependencies, race-conditions, synchronization barriers and atomic operations. As an example, we observe that a highly optimized GPU implementation of a reduction algorithm is about 30x faster than a naive implementation [10]. In the future, many programmers need to have access to the GPU’s computing power, while nowadays only a fraction of programmers is able to benefit from these resources. There- fore, in order to use heterogeneous platforms efficiently, we have to change GPU programming. One way to achieve this is to automatically generate high performance code, prefer- ably without requiring input (such as annotations) from the programmer. In this paper, we focus on image processing algorithms, tackling the problem of how to generate high performance GPU code from sequential CPU code. We use skeleton-based parallelization, enabling the generation of GPU code based on sequential CPU code, requiring no programmer’s knowledge on the architecture nor on the programming language. The contributions of this paper are as follows: • Image processing algorithms are classified into a number of fine-grained skeleton classes. Classification is based on a study of the OpenCV library. • For each class, a skeleton implementation is presented. To use the GPU’s hardware efficiently, we generate optimized code. This includes among others high off- chip memory throughput and the efficient use of on-chip shared memory. • We evaluate the limitations of the presented technique and discuss the future of skeleton-based parallelization for GPUs. This paper is organized as follows. First, related work is discussed in section II. A motivation for the used techniques is found in section III. Section IV introduces GPU programming and image processing. Section V presents the code gener- ation techniques. Next, section VI introduces and discusses the skeleton implementations. Section VII elaborates on the efficient use of GPU hardware, presents the results of skeleton- based parallelization, and discusses future work. Section VIII concludes the paper.