PATUS: A Code Generation and Autotuning Framework For Parallel Iterative
Stencil Computations on Modern Microarchitectures
Matthias Christen, Olaf Schenk, Helmar Burkhart
Department of Mathematics and Computer Science
University of Basel, Switzerland
{ m.christen | olaf.schenk | helmar.burkhart } @unibas.ch
Abstract—Stencil calculations comprise an important class of
kernels in many scientific computing applications ranging from
simple PDE solvers to constituent kernels in multigrid methods
as well as image processing applications. In such types of
solvers, stencil kernels are often the dominant part of the com-
putation, and an efficient parallel implementation of the kernel
is therefore crucial in order to reduce the time to solution.
However, in the current complex hardware microarchitectures,
meticulous architecture-specific tuning is required to elicit the
machine’s full compute power. We present a code generation
and auto-tuning framework PATUS for stencil computations
targeted at multi- and manycore processors, such as multicore
CPUs and graphics processing units, which makes it possible
to generate compute kernels from a specification of the stencil
operation and a parallelization and optimization strategy, and
leverages the autotuning methodology to optimize strategy-
dependent parameters for the given hardware architecture.
Keywords-stencil computations; code generation; autotuning;
high performance computing
I. I NTRODUCTION
Stencil calculations comprise an important class of kernels
in many scientific computing applications ranging from sim-
ple PDE solvers to constituent kernels in multigrid methods
as well as image processing applications. Often, in such
types of solvers, the major part of the computation time is
spent in a stencil kernel. Therefore it is important in order
to minimize the time to solution, that the stencil kernels
make use of the available computing resources as efficiently
as possible. However, in the current complex hardware
microarchitectures, meticulous architecture-specific tuning is
required to elicit the machine’s full power. This not only
requires deeper understanding of the architecture, but is also
both a time consuming and error-prone process.
Libraries and code generators for other important kernels
in scientific computing, including dense and sparse linear al-
gebra and discrete transforms, successfully adapt autotuning
as a means to automatically select the code that delivers the
best performance from a family of codes based on automatic
performance benchmarks.
The PATUS framework is a code generation and auto-
tuning tool for the class of stencil computations. PATUS
stands for “Parallel AutoTUned Stencils”. It is the result of
generalizing the insights gained from performance studies
of a kernel from a real-world application involving different
kinds of architectures.
The idea behind the PATUS framework is twofold: on the
one hand it provides a software infrastructure for generating
architecture-specific stencil code from a specification of
the stencil incorporating domain-specific knowledge that
permits to optimize the code beyond the abilities of current
compilers, and on the other hand it aims at being an
experimentation toolbox for parallelization and optimization
strategies. Using small domain specific languages, the user
can define the stencil kernel using a C-like syntax, and
can choose from predefined strategies how the kernel is
optimized and parallelized, or design a custom strategy in
order to experiment with other algorithms or find a better
mapping to the hardware in use. This is one of the key
features in which PATUS differs from other code generation
and autotuning frameworks for stencil codes, such as the
one proposed by Kamil [1].
Besides supporting almost arbitrary types of stencils on
structured grids and generating code from strategy templates,
another goal of PATUS is to be able to support future
hardware microarchitectures and programming paradigms.
The modular code generator back-end allows adding support
for new hardware by defining hardware-specific character-
istics and implementing code generator methods for a few
communication and synchronization primitives.
Currently we support traditional CPU architectures using
OpenMP for parallelization and NVIDIA CUDA-capable
GPUs.
II. RELATED WORK
Autotuning has been applied successfully in diverse li-
braries and frameworks for various types of kernels which
occur frequently in scientific computing, including ATLAS
[2] and FLAME [3] for dense linear algebra, OSKI [4]
for sparse linear algebra, FFTW [5] and SPIRAL [6] for
signal processing transforms, and recently in a framework
for stencil computations [1].
The search space is either a (possibly parameterized)
code base from which the autotuner in an offline process
determines the version and parameters that display the best
performance on a given architecture (ATLAS, FFTW using
2011 IEEE International Parallel & Distributed Processing Symposium
1530-2075/11 $26.00 © 2011 IEEE
DOI 10.1109/IPDPS.2011.70
676
2011 IEEE International Parallel & Distributed Processing Symposium
1530-2075/11 $26.00 © 2011 IEEE
DOI 10.1109/IPDPS.2011.70
676
2011 IEEE International Parallel & Distributed Processing Symposium
1530-2075/11 $26.00 © 2011 IEEE
DOI 10.1109/IPDPS.2011.70
673
2011 IEEE International Parallel & Distributed Processing Symposium
1530-2075/11 $26.00 © 2011 IEEE
DOI 10.1109/IPDPS.2011.70
676
2011 IEEE International Parallel & Distributed Processing Symposium
1530-2075/11 $26.00 © 2011 IEEE
DOI 10.1109/IPDPS.2011.70
676