PATUS: A Code Generation and Autotuning Framework For Parallel Iterative Stencil Computations on Modern Microarchitectures Matthias Christen, Olaf Schenk, Helmar Burkhart Department of Mathematics and Computer Science University of Basel, Switzerland { m.christen | olaf.schenk | helmar.burkhart } @unibas.ch Abstract—Stencil calculations comprise an important class of kernels in many scientific computing applications ranging from simple PDE solvers to constituent kernels in multigrid methods as well as image processing applications. In such types of solvers, stencil kernels are often the dominant part of the com- putation, and an efficient parallel implementation of the kernel is therefore crucial in order to reduce the time to solution. However, in the current complex hardware microarchitectures, meticulous architecture-specific tuning is required to elicit the machine’s full compute power. We present a code generation and auto-tuning framework PATUS for stencil computations targeted at multi- and manycore processors, such as multicore CPUs and graphics processing units, which makes it possible to generate compute kernels from a specification of the stencil operation and a parallelization and optimization strategy, and leverages the autotuning methodology to optimize strategy- dependent parameters for the given hardware architecture. Keywords-stencil computations; code generation; autotuning; high performance computing I. I NTRODUCTION Stencil calculations comprise an important class of kernels in many scientific computing applications ranging from sim- ple PDE solvers to constituent kernels in multigrid methods as well as image processing applications. Often, in such types of solvers, the major part of the computation time is spent in a stencil kernel. Therefore it is important in order to minimize the time to solution, that the stencil kernels make use of the available computing resources as efficiently as possible. However, in the current complex hardware microarchitectures, meticulous architecture-specific tuning is required to elicit the machine’s full power. This not only requires deeper understanding of the architecture, but is also both a time consuming and error-prone process. Libraries and code generators for other important kernels in scientific computing, including dense and sparse linear al- gebra and discrete transforms, successfully adapt autotuning as a means to automatically select the code that delivers the best performance from a family of codes based on automatic performance benchmarks. The PATUS framework is a code generation and auto- tuning tool for the class of stencil computations. PATUS stands for “Parallel AutoTUned Stencils”. It is the result of generalizing the insights gained from performance studies of a kernel from a real-world application involving different kinds of architectures. The idea behind the PATUS framework is twofold: on the one hand it provides a software infrastructure for generating architecture-specific stencil code from a specification of the stencil incorporating domain-specific knowledge that permits to optimize the code beyond the abilities of current compilers, and on the other hand it aims at being an experimentation toolbox for parallelization and optimization strategies. Using small domain specific languages, the user can define the stencil kernel using a C-like syntax, and can choose from predefined strategies how the kernel is optimized and parallelized, or design a custom strategy in order to experiment with other algorithms or find a better mapping to the hardware in use. This is one of the key features in which PATUS differs from other code generation and autotuning frameworks for stencil codes, such as the one proposed by Kamil [1]. Besides supporting almost arbitrary types of stencils on structured grids and generating code from strategy templates, another goal of PATUS is to be able to support future hardware microarchitectures and programming paradigms. The modular code generator back-end allows adding support for new hardware by defining hardware-specific character- istics and implementing code generator methods for a few communication and synchronization primitives. Currently we support traditional CPU architectures using OpenMP for parallelization and NVIDIA CUDA-capable GPUs. II. RELATED WORK Autotuning has been applied successfully in diverse li- braries and frameworks for various types of kernels which occur frequently in scientific computing, including ATLAS [2] and FLAME [3] for dense linear algebra, OSKI [4] for sparse linear algebra, FFTW [5] and SPIRAL [6] for signal processing transforms, and recently in a framework for stencil computations [1]. The search space is either a (possibly parameterized) code base from which the autotuner in an offline process determines the version and parameters that display the best performance on a given architecture (ATLAS, FFTW using 2011 IEEE International Parallel & Distributed Processing Symposium 1530-2075/11 $26.00 © 2011 IEEE DOI 10.1109/IPDPS.2011.70 676 2011 IEEE International Parallel & Distributed Processing Symposium 1530-2075/11 $26.00 © 2011 IEEE DOI 10.1109/IPDPS.2011.70 676 2011 IEEE International Parallel & Distributed Processing Symposium 1530-2075/11 $26.00 © 2011 IEEE DOI 10.1109/IPDPS.2011.70 673 2011 IEEE International Parallel & Distributed Processing Symposium 1530-2075/11 $26.00 © 2011 IEEE DOI 10.1109/IPDPS.2011.70 676 2011 IEEE International Parallel & Distributed Processing Symposium 1530-2075/11 $26.00 © 2011 IEEE DOI 10.1109/IPDPS.2011.70 676