Petascale direct numerical simulation of blood flow on 200K cores and heterogeneous architectures Abtin Rahimian ∗ , Ilya Lashuk ∗ , Shravan K. Veerapaneni † , Aparna Chandramowlishwaran ∗ Dhairya Malhotra ∗ , Logan Moon ∗ , Rahul Sampath ‡ , Aashay Shringarpure ∗ , Jeffrey Vetter ‡ , Richard Vuduc ∗ , Denis Zorin † , and George Biros ∗ ∗ Georgia Institute of Technology, Atlanta, GA 30332 † New York University, New York, NY 10002 ‡ Oak Ridge National Laboratory, Oak Ridge, TN 37831 rahimian@gatech.edu, ilashuk@cc.gatech.edu, shravan@cims.nyu.edu, aparna@cc.gatech.edu, dhairya.malhotra88@gmail.com, lmoon3@gatech.edu, rahul.sampath@gmail.com, aashay.shringarpure@gmail.com, vetter@computer.org, richie@cc.gatech.edu, dzorin@cs.nyu.edu, gbiros@acm.org Abstract—We present a fast, petaflop-scalable algorithm for Stokesian particulate flows. Our goal is the direct simulation of blood, which we model as a mixture of a Stokesian fluid (plasma) and red blood cells (RBCs). Directly simulating blood is a challenging multiscale, multiphysics problem. We report simulations with up to 200 million deformable RBCs. The largest simulation amounts to 90 billion unknowns in space. In terms of the number of cells, we improve the state-of-the art by several orders of magnitude: the previous largest simulation, at the same physical fidelity as ours, resolved the flow of O(1,000-10,000) RBCs. Our approach has three distinct characteristics: (1) we faithfully represent the physics of RBCs by using nonlinear solid mechanics to capture the deformations of each cell; (2) we accurately resolve the long-range, N-body, hydrodynamic interactions between RBCs (which are caused by the surrounding plasma); and (3) we allow for the highly non-uniform distribution of RBCs in space. The new method has been implemented in the software library MOBO (for “Moving Boundaries”). We designed MOBO to support parallelism at all levels, including inter-node distributed memory parallelism, intra-node shared memory parallelism, data paral- lelism (vectorization), and fine-grained multithreading for GPUs. We have implemented and optimized the majority of the compu- tation kernels on both Intel/AMD x86 and NVidia’s Tesla/Fermi platforms for single and double floating point precision. Overall, the code has scaled on 256 CPU-GPUs on the Teragrid’s Lincoln cluster and on 200,000 AMD cores of the Oak Ridge National Laboratory’s Jaguar PF system. In our largest simulation, we have achieved 0.7 Petaflops/s of sustained performance on Jaguar. I. I NTRODUCTION Clinical needs in thrombosis risk assessment, anti-coagulation therapy, and stroke research would significantly benefit from an improved understanding of the microcirculation of blood. Toward this end, we present a new computational infrastructure, MOBO, that enables the direct numerical simulation of several microliters of blood at new levels of physical fidelity (Figure 1). MOBO consists of two key algorithmic components: (1) scalable integral equation solvers for Stokesian flows with dynamic interfaces; and (2) scalable fast multipole algorithms. In terms of size alone, MOBO’s overall simulation capability represents an advance that is orders of magnitude beyond what was achieved in prior work in blood flow simulation. We use an algorithmically optimal semi-implicit scheme. Unlike explicit time-stepping schemes that require only near- neighbor communication (but are algorithmically suboptimal for our problem), our solver requires global communication at every time step. Consequently, our solver is much more challenging to scale than an explicit solver. Nevertheless, our results demonstrate that it is possible to successfully scale implicit solvers to hundreds of thousands of cores. Challenges in Direct Numerical Simulation of Blood. Just one microliter of blood of a healthy individual contains approximately four million RBCs. The surrounding plasma, which is a viscous fluid, mechanically couples every RBC to all other RBCs. Furthermore, RBCs are highly deformable (it is the deformability of RBCs which determines the rheological properties of blood). The large number of cells and their com- plex local and global interactions pose significant challenges in designing tools for high-fidelity scalable numerical simulations of blood. For example, the largest simulations today have scaled to 1,200 cells [40] (using boundary integral equations, like us) and 14,000 cells [8] (using lattice Boltzmann methods). The latter work, which is based on an explicit time-stepping scheme, scaled up to 64K cores but requires an excessive number of time steps due to the non-physical stiffness introduced by the numerical scheme. Other methods that model RBCs as rigid bodies have scaled to an even large number of RBCs; but these are crude approximations of the blood flow. Deformable models of RBCs (Figure 1) are critical for accurate blood flow simulations. 1 Our Approach. We model RBCs as deformable viscous 1 For example, a 65% volume-fraction suspension of rigid spheres cannot flow; blood flows even when the volume-fraction of RBCs in the plasma reaches 95% [2]. c 2010 IEEE Personal use of this material is permitted. However, permission to reprint/republish this material for advertising or promotional purposes or for creating new collective works for resale or redistribution to servers or lists, or to reuse any copyrighted component of this work in other works must be obtained from the IEEE. SC10 November 2010, New Orleans, Louisiana, USA 978-1- 4244-7558-2/10/$26.00