Hybrid CPU-GPU acceleration of the 3-D parallel code SPH-Flow G.Oger, E. Jacquin, M. Doring, P.-M. Guilcher HydrOcean Nantes, France guillaume.oger@hydrocean.fr R. Dolbeau, P.-L. Cabelguen, L. Bertaux CAPS Entreprise Rennes, France D. Le Touzé, B. Alessandrini Laboratoire de Mécanique des Fluides Ecole Centrale Nantes Nantes, France Abstract— SPH-flow software developed jointly by the Fluid Mechanics Laboratory of Ecole Centrale Nantes and HydrOcean is now capable of modelling complex free surface flows with complex 3-D geometries, fluid-structure interactions, and multi- fluids [6] [7]. Today the developments of this code are continuing inside a consortium joining several academic and industrial partners. This 3-D parallel software based on MPI is clearly oriented towards massively parallel computing on machines fitted with several hundreds of CPU nodes in distributed memory architectures. Thanks to its current implementation, this code is able to process industrial applications involving millions of particles [4][5][8]. The stated aim is to push further the current limits of SPH calculations, bypassing the inherent limits of this method (small time steps, time consuming particle-to-particle interaction loops, etc). As such, the code must evolve to benefit from new hardware architectures available on the market. Today, the use of GPU devices to speed up scientific calculations is widely documented, and appears to be a real evolution in computational technologies. Different test studies related to the GPU translation of SPH codes have been conducted recently, mainly within the SPHERIC community, showing very promising results [2][3]. Generally speaking, other specialties for which scattered data need to be computed see a great interest in GPU (see for example [1] in molecular dynamics). However, GPU computing requires a specialized adaptation of the code in regards to language (CUDA or other languages) and the algorithms implemented (memory access and contiguous data in the memory bank). This paper presents a first attempt of CPU-GPU hybridization of SPH-Flow. The original MPI-based distributed memory parallelization is preserved, allowing massive calculations on several CPU and GPU devices. I. INTRODUCTION The work presented in this paper is a joint effort between Hydrocean and CAPS Entreprise. The latter company is developing the “Hybrid Multicore Parallel Programming” (HMPP) technology. This programming environment is a directive-based compiler dedicated to build parallel GPU accelerated applications. It targets NVIDIA as well as AMD/ATI GPUs. As part of the promotion and development of High-Performance Computing in France and Europe, the French national high-performance computing agency GENCI (Grand Equipement National de Calcul Intensif) joined CAPS Entreprise to carry out a call for proposals for the scientific community for porting complex applications on hybrid graphic-based accelerator systems. The main requirement on admission to the project resided in the prior existence of a parallelization dedicated to distributed memory architectures (via MPI for instance). SPH-Flow has been retained on this project call, and a first attempt of CPU-GPU hybridization has been achieved. For this first attempt of CPU-GPU hybridization, it was thus decided to retain the current parallel algorithm of SPH-Flow (based on Fortran90 + MPI) and to adapt the main time consuming procedures to GPU computing. In this paper, the various questions regarding CPU-GPU hybridization of a code initially programmed using MPI libraries is presented and discussed. Particular attention is given to the necessary changes to be performed on the SPH algorithm originally dedicated to parallel CPU for making it fit to CPU-GPU hardware. The main adaptation techniques used are based on particle-to-particle interaction loop adaptations, and on Hilbert space filling curves dedicated to sorting particle data. The chosen strategy for the CPU-GPU translation of a code originally designed for parallel CPU architectures is then presented. Finally, the speedup obtained is presented and discussed on up to 32 CPU-GPU hybrid compute nodes.