A real-time multigrid finite hexahedra method for elasticity simulation using CUDA Christian Dick ⇑ , Joachim Georgii, Rüdiger Westermann Computer Graphics and Visualization Group, Technische Universität München, Germany article info Article history: Received 2 August 2010 Received in revised form 8 November 2010 Accepted 9 November 2010 Available online 21 November 2010 Keywords: Elasticity simulation Deformable objects Finite element methods Multigrid GPU CUDA abstract We present a multigrid approach for simulating elastic deformable objects in real time on recent NVIDIA GPU architectures. To accurately simulate large deformations we consider the co-rotated strain formulation. Our method is based on a finite element discretization of the deformable object using hexahedra. It draws upon recent work on multigrid schemes for the efficient numerical solution of partial differential equations on such discretizations. Due to the regular shape of the numerical stencil induced by the hexahedral regime, and since we use matrix-free formulations of all multigrid steps, computations and data layout can be restructured to avoid execution divergence of parallel running threads and to enable coalescing of memory accesses into single memory transactions. This enables to effectively exploit the GPU’s parallel processing units and high memory bandwidth via the CUDA par- allel programming API. We demonstrate performance gains of up to a factor of 27 and 4 compared to a highly optimized CPU implementation on a single CPU core and 8 CPU cores, respectively. For hexahedral models consisting of as many as 269,000 elements our approach achieves physics-based simulation at 11 time steps per second. Ó 2010 Elsevier B.V. All rights reserved. 1. Introduction Over the last years, graphics processing units (GPUs) have shown a substantial performance increase on intrinsically par- allel computations. Key to this evolution is the GPU’s design for massively parallel tasks, with the emphasis on maximizing total throughput of all parallel units. The ability to simultaneously use many processing units and to exploit thread level par- allelism to hide latency have led to impressive performance increases in a number of scientific applications. One prominent example is NVIDIA’s Fermi GPU [1], on which we have based our current developments. It consists of 15 multiprocessors, on each of which several hundreds of co-resident threads can execute integer as well as single and double precision floating point operations. Double precision operations are running at 1/2 of the speed of single precision opera- tions. Each multiprocessor is equipped with a register file that is partitioned among the threads residing on the multipro- cessor, as well as a small low-latency on-chip memory block which can be randomly accessed by these threads. Threads are further provided with direct read/write access to global off-chip video memory. These accesses are cached using a two level cache hierarchy. The threads on each multiprocessor are executed in groups of 32 called warps, and all threads within one warp run in lock-step. Due to this reason the GPU works most efficiently if all threads within one warp follow the same execution path. Automatic hardware multithreading is used to schedule warps in such a way as to hide latency caused by memory access operations. Switching between warps is virtually at no cost, since threads are permanently resident on a multiprocessor 1569-190X/$ - see front matter Ó 2010 Elsevier B.V. All rights reserved. doi:10.1016/j.simpat.2010.11.005 ⇑ Corresponding author. E-mail addresses: dick@tum.de (C. Dick), georgii@tum.de (J. Georgii), westermann@tum.de (R. Westermann). Simulation Modelling Practice and Theory 19 (2011) 801–816 Contents lists available at ScienceDirect Simulation Modelling Practice and Theory journal homepage: www.elsevier.com/locate/simpat