Published: March 23, 2011 r2011 American Chemical Society 949 dx.doi.org/10.1021/ct100701w | J. Chem. Theory Comput. 2011, 7, 949–954 ARTICLE pubs.acs.org/JCTC Dynamic Precision for Electron Repulsion Integral Evaluation on Graphical Processing Units (GPUs) Nathan Luehr, Ivan S. Ufimtsev, and Todd J. Martínez* PULSE Institute and Department of Chemistry, Stanford University, Stanford, California 94305, United States SLAC National Accelerator Laboratory, Menlo Park, California 94025, United States b S Supporting Information ABSTRACT: It has recently been demonstrated that novel streaming architectures found in consumer video gaming hardware such as graphical processing units (GPUs) are well-suited to a broad range of computations including electronic structure theory (quantum chemistry). Although recent GPUs have developed robust support for double precision arithmetic, they continue to provide 2À8Â more hardware units for single precision. In order to maximize performance on GPU architectures, we present a technique of dynamically selecting double or single precision evaluation for electron repulsion integrals (ERIs) in HartreeÀFock and density functional self-consistent field (SCF) calculations. We show that precision error can be effectively controlled by evaluating only the largest integrals in double precision. By dynamically scaling the precision cutoff over the course of the SCF procedure, we arrive at a scheme that minimizes the number of double precision integral evaluations for any desired accuracy. This dynamic precision scheme is shown to be effective for an array of molecules ranging in size from 20 to nearly 2000 atoms. ’ INTRODUCTION It has recently been recognized that consumer video game hardware is well suited to many tasks in computational chem- istry, including electronic structure theory, 1À10 ab initio molec- ular dynamics, 11 and empirical force- field-based molecular dynamics. 8,12À14 The emergence of the CUDA development framework from NVIDIA has made it much easier to repurpose this hardware for scientific computing, 15 compared to early efforts on similar architectures that had to resort to low level instructions. 16 Nevertheless, efficient use of graphical processing units (GPUs) requires careful attention to some specialized hardware constraints such as memory access patterns and non- uniform efficiency of floating point arithmetic in different preci- sion. Furthermore, GPUs have been carefully designed for maximum performance in specific graphics processing tasks and are otherwise severely limited. It is unlikely that these limitations will be fully eliminated because in large part they provide the foundation of the GPUs computational prowess. The first CUDA-enabled GPUs had no support for double precision arithmetic, demanding care in their use for quantum chemistry applications. The latest GPUs fully support double precision arithmetic, with stunning performance in the range of several hundred GFLOPS, well beyond that of traditional processors (CPUs). Nevertheless, single precision continues to maintain between 2Â and 8Â more instruction units than double precision on the latest generation of GPUs. This disparity stems from the hardware’s pedigree in graphics, where there is little need for double precision accuracy, and the necessary increase in circuitry is difficult to justify. Single precision may exhibit further performance advantages as a result of its smaller memory footprint, which reduces data bandwidth requirements 2 and increases the number of values that can be cached in registers. Thus, for maximum performance, it remains important to favor single precision arithmetic as much as possible on GPUs. To balance GPU performance with chemical accuracy, quan- tum chemistry implementations have adopted mixed precision approaches in which double precision operations are added sparingly to an otherwise single precision calculation. Matrix multiplication in the context of resolution-of-the-identity MøllerÀPlesset perturbation theory has been shown to provide accurate mixed precision results, even when the majority of operations are carried out in single precision. 4,6 Single precision ERI evaluation has been successfully augmented with double precision accumulation into the matrix elements of the Coulomb and exchange operators. 3,5 “Double precision accumulation” simply means that the ERIs are evaluated in single precision, but a double precision variable is used to accumulate the products of density matrix elements and ERIs which make up the final operator (e.g., Coulomb or exchange). For example, the Cou- lomb operator can be constructed as J 64 μν þ¼ P 32 λσ ðμνjλσÞ 32 ð1Þ where the superscripts indicate the number of bits of precision used for the labeled variable, and the ERIs are given as ðμνjλσÞ¼ Z φ μ ðr 1 Þ φ ν ðr 1 Þ φ λ ðr 2 Þ φ σ ðr 2 Þ jr 1 À r 2 j dr 1 dr 2 ð2Þ Computing a few of the largest ERIs in full double precision has also been shown 5 to improve accuracy compared to calculations using only single precision for all ERIs. Incremental construction of the Fock matrix 17 has been noted to improve the accuracy of Received: December 6, 2010