Empirical performance model-driven data layout optimization and library call selection for tensor contraction expressions Qingda Lu a,1 , Xiaoyang Gao a,2 , Sriram Krishnamoorthy a,3 , Gerald Baumgartner b, , J. Ramanujam c , P. Sadayappan a a Department of Computer Science and Engineering, The Ohio State University, Columbus, OH 43210, USA b Department of Computer Science, Louisiana State University, Baton Rouge, LA 70803, USA c Dept. of Electrical and Computer Engineering, Louisiana State University, Baton Rouge, LA 70803, USA Abstract Empirical optimizers like ATLAS have been very eective in optimizing computational kernels in libraries. The best choice of parameters such as tile size and degree of loop unrolling is determined in ATLAS by executing dierent ver- sions of the computation. In contrast, optimizing compilers use a model-driven approach to program transformation. While the model-driven approach of optimizing compilers is generally orders of magnitude faster than ATLAS-like library generators, its eectiveness can be limited by the accuracy of the performance models used. In this paper, we describe an approach where a class of computations is modeled in terms of constituent operations that are empiri- cally measured, thereby allowing modeling of the overall execution time. The performance model with empirically determined cost components is used to select library calls and choose data layout transformations in the context of the Tensor Contraction Engine, a compiler for a high-level domain-specific language for expressing computational models in quantum chemistry. The eectiveness of the approach is demonstrated through experimental measurements on representative computations from quantum chemistry. Keywords: data layout optimization, library call selection, compiler optimization, tensor contractions 1. Introduction Optimizing compilers use high-level program transformations to generate ecient code. The computation is modeled in some form and its cost is derived in terms of metrics such as reuse distance. Program transformations are then applied in order to reduce the cost. The large number of parameters and the variety of programs to be handled limits optimizing compilers to employ model-driven optimization with relatively simple cost models. As a result, there has been much recent interest in developing generalized tuning systems that can similarly tune and optimize codes input by users or library developers [12, 60, 57]. Approaches to empirically optimize a computation, such as ATLAS [59] (for linear algebra) and FFTW [19] generate solutions for dierent structures of the optimized code and determine the parameters that optimize the execution time by running dierent versions of the code for a given target architecture and choosing the optimal one. But empirical optimization of large complex applications can be prohibitively expensive. In this paper, we decompose a class of computations into its constituent operations and model the execution time of the computation in terms of an empirical characterization of its constituent operations. The empirical measurements allow modeling of the overall execution time of the computation, while decomposition enables o-line determination of the cost model and ecient global optimization across multiple constituent operations. This approach combines This work was supported in part by the National Science Foundation under grants CHE-0121676, CHE-0121706, CNS-0509467, CCF- 0541409, CCF-1059417, CCF-0073800, EIA-9986052, and EPS-1003897. Corresponding author. Tel.: +1 225 578 2191; fax: +1 225 578 1465. 1 Current address: Software and Services Group, Intel Corporation, 2111 NE 25th Ave., Hillsboro, OR 97124, USA 2 Current address: IBM Silicon Valley Lab., 555 Bailey Ave., San Jose, CA 95141, USA 3 Current address: Computational Sciences and Mathematics Division, Pacific Northwest National Laboratory, Richland, WA 99352, USA Preprint submitted to Elsevier September 27, 2011