Autotuning Stencil-Based Computations on GPUs Azamat Mametjanov , Daniel Lowell , Ching-Chen Ma , Boyana Norris Mathematics and Computer Science Division Argonne National Laboratory, Argonne, IL 60439 {azamat,dlowell,norris}@mcs.anl.gov Computer Science & Software Engineering Rose-Hulman Institute of Technology, Terre Haute, IN 47803 mac@rose-hulman.edu Abstract—Finite-difference, stencil-based discretization ap- proaches are widely used in the solution of partial differential equations describing physical phenomena. Newton-Krylov itera- tive methods commonly used in stencil-based solutions generate matrices that exhibit diagonal sparsity patterns.To exploit these structures on modern GPUs, we extend the standard diagonal sparse matrix representation and define new matrix and vector data types in the PETSc parallel numerical toolkit. We create tunable CUDA implementations of the operations associated with these types after identifying a number of GPU-specific optimiza- tions and tuning parameters for these operations. We discuss our implementation of GPU autotuning capabilities in the Orio framework and present performance results for several kernels, comparing them with vendor-tuned library implementations. Index Terms—autotuning, stencil, CUDA, GPU I. I NTRODUCTION Exploiting hybrid CPU/GPU architectures effectively typi- cally requires reimplementation of existing CPU codes. Fur- thermore, the rapid evolution in accelerator capabilities means that GPU implementations must be revised frequently to attain good performance. One approach to avoiding such code reimplementation and manual tuning is to automate CUDA code generation and tuning. In this paper, we introduce a preliminary implementation of a CUDA backend in our Orio autotuning framework, which accepts a high-level specification of the computation as input and then generates multiple code versions that are empirically evaluated to select the best-performing version for given problem inputs and target hardware. In our prototype, we target key kernels in the PETSc parallel numerical toolkit, which is widely used to solve problems modeled by nonlinear partial differential equations (PDEs). A. Motivation Increasing heterogeneity in computer architectures at all scales presents significant new challenges to effective software development in scientific computing. Key numerical kernels in high-performance scientific li- braries such as Hypre [11], PETSc [3], [4], [5], SuperLU [13], and Trilinos [22] are responsible for much of the execution time of scientific applications. Typically, these kernels imple- ment the steps of an iterative linear solution method, which is used to solve the linearized problem by using a family of Newton-Krylov methods. In order to achieve good perfor- mance, these kernels must be optimized for each particular architecture. Automating the generation of highly optimized versions of key kernels will improve both application perfor- mance and the library developers’ productivity. Furthermore, libraries can be “customized” for specific applications by generating versions that are optimized for specific input and use characteristics. Traditionally, numerical libraries are built infrequently on a given machine, and then applications are linked against these prebuilt versions to create an executable. While this model has worked well for decades, allowing the encapsu- lation of sophisticated numerical approaches in application- independent, reusable software units, it suffers from several drawbacks to achieving high performance on modern architec- tures. First, it provides a partial view of the implementation (either when compiling the library or the application code using it), limiting potential compiler optimizations. Because the library is built without any information on how exactly it will be used, many potentially beneficial optimizations are not considered. Second, large toolkits such as PETSc and Trilinos provide many configuration options whose values can significantly affect application performance. Blindly using a prebuilt library can result in much lower performance than achievable on a particular hardware platform. Even when a multitude of highly optimized methods exist, it is not always clear which implementation is most appropriate in a given application context. For example, the performance of different sparse linear solvers varies for linear systems with different characteristics. Our goal is to tackle the challenges in achieving the best possible performance in the low-level fundamental kernels that many higher-level numerical algorithms share through application-aware code generation and tuning. Several com- ponents are required in order to provide these code generation capabilities. A mechanism for defining the computation at a sufficiently high level Application-specific library optimizations Automatic code generation and tuning of computationally significant library and application functionality Ability to use nongenerated (manual) implementations when desired