Scalable multi-GPU implementation of the MAGFLOW simulator Eugenio Rustico 1,* , Giuseppe Bilotta 1,2 , Alexis Hérault 2,3 , Ciro Del Negro 2 , Giovanni Gallo 1 1 Università di Catania, Dipartimento di Matematica e Informatica, Catania, Italy 2 Istituto Nazionale di Geofisica e Vulcanologia, Sezione di Catania, Osservatorio Etneo, Catania, Italy 3 Conservatoire des Arts et Métiers, Département Ingénierie Mathématique, Paris, France ANNALS OF GEOPHYSICS, 54, 5, 2011; doi: 10.4401/ag-5342 ABSTRACT We have developed a robust and scalable multi-GPU (Graphics Processing Unit) version of the cellular-automaton-based MAGFLOW lava simulator. The cellular automaton is partitioned into strips that are assigned to different GPUs, with minimal overlapping. For each GPU, a host thread is launched to manage allocation, deallocation, data transfer and kernel launches; the main host thread coordinates all of the GPUs, to ensure temporal coherence and data integrity. The overlapping borders and maximum temporal step need to be exchanged among the GPUs at the beginning of every evolution of the cellular automaton; data transfers are asynchronous with respect to the computations, to cover the introduced overhead. It is not required to have GPUs of the same speed or capacity; the system runs flawlessly on homogeneous and heterogeneous hardware. The speed-up factor differs from that which is ideal (#GPUs×) only for a constant overhead loss of about 4E -2 · T · #GPUs, with T as the total simulation time. 1. Introduction Numerical problems that expose a high degree of intrinsic parallelism can often be mapped to more than one level of parallelism. If it is possible to model a problem as a set of pseudo-independent subproblems, then it is possible to split the set of computations needed to solve it into two or more partitions that can be executed separately. Very few numerical problems have completely inter-dependent subproblems, such as astrophysical n-body simulations; sometimes it is nevertheless possible to parallelize such problems by means of numerical approximations or by adding specific constraints to the model. Cellular automaton methods are by definition made up of quasi-independent subproblems: the state of every cell of the automaton depends only on the previous state of the same cell and on the state of the neighbor cells. Given a cell and its neighborhood, we can compute its next state in parallel and with almost no knowledge about the other cells. The same reasoning applies to sets of cells that are in turn independent from other sets, except for their neighborhood. In the general case, it is possible to exploit a two-level parallelism and to split an already parallel cellular automaton simulation into parts to be executed on different devices, which can include a graphics processing unit (GPU) card, a central processing unit (CPU) core, the node of a cluster, or even a computer in a network. The real feasibility and the benefit of this theoretical possibility are, however, to be evaluated case by case, as there may be model peculiarities that require specific adaptations or that conflict with the technical limitations of the underlying architecture. The most limiting factors are typically the latency and/or the bandwidth of the inter- device communication channel, which is usually orders of magnitude slower than any intra-device communication. The MAGFLOW lava simulation model [Vicari et al. 2007] is the cellular automaton developed by the Sezione di Catania of the Istituto Nazionale di Geofisica e Vulcanologia (INGV), and it represents the peak of the evolution of cell- based models for lava-flow simulations. The conversion of the MAGFLOW cellular automaton from serial single-CPU code to a parallel, GPU-based implementation is discussed in Bilotta et al. [2011] in this volume; here, we show how we exploited a second level of parallelism by running a simulation on two or more GPUs simultaneously. There are at least two ways to exploit a two-levels parallelism in the evolution of a cellular automaton. The simplest and most intuitive is to spatially partition the automaton into disjointed areas and to assign each area to a separate device. As we need to operate locally and read information about the neighboring cells, these areas can never be truly disjointed; it is therefore more correct to refer to them as subsets rather than partitions. We could instead operate the division in the domain of computations, thus realizing a pipeline: each device is assigned to all of the cells in the domain, but elaborates only for a specific phase of the evolution. Unfortunately, this approach presents three major drawbacks: Article history Received November 23, 2010; accepted July 15, 2011. Subject classification: Computational geophysics, General purpose GPU modeling, HPC, Parallel programming, GPU, Multi-GPU, Hazard, Lava. 592 Special Issue: V3-LAVA PROJECT