Parallel GMRES implementation for solving sparse linear systems on GPU clusters Jacques M. Bahi, Rapha¨ el Couturier, Lilia Ziane Khodja University of Franche-Comte, LIFC laboratory, Rue Engel-Gros, BP 527, 90016 Belfort Cedex, France {jacques.bahi, raphael.couturier, lilia.ziane khoja}@univ-fcomte.fr Keywords: sparse linear systems, GMRES, CUDA, GPU cluster Abstract In this paper, we propose an efficient parallel implementation of the GMRES method for GPU clusters. This implementa- tion requires us to parallelize the GMRES algorithm between CPUs of the cluster. Hence, all parallel and intensive compu- tations on local data are performed on GPUs and reduction operations to compute global results are carried out by CPUs. The performances of our parallel GMRES solver are evalu- ated on test matrices of sizes exceeding 10 7 rows. They show that solving large and sparse linear systems on a GPU cluster is faster than that performed on its CPU counterpart. It is no- ticed that a cluster of 12 GPUs is about 8 times faster than a cluster of 12 CPUs and about 5 times faster than a cluster of 24 CPUs. 1. INTRODUCTION In numerous scientific and industrial applications, solving large sparse linear systems is often the most costly step, in both CPU time and memory space, in the computing process. In fact, solving such linear systems tends to be slowed down due to the large number of their unknowns and the irregular- ity of memory accesses to the few number of non-zero values of their matrices. Thus, any reduction in solving time of these systems will result in a significant saving in the total numeri- cal computation time. For the past few years, modern graphics processing units (GPUs) have become attractive tools to provide the high com- puting power and the large memory bandwidth required for sparse linear system solutions [6], [12]. Although initially they were designed to carry out intensive computations of graphic applications, GPUs have rapidly evolved to become high performance accelerators for data-parallel tasks and in- tensive arithmetic computations. To exploit the computational power of this architecture, Nvidia has released the CUDA platform (Compute Unified Device Architecture) [14] which provides a high level GPGPU-based programming language (General-Purpose computing on GPUs) allowing us to pro- gram GPUs not only for graphic application purposes but also for general purpose computations of non-graphic appli- cations. Nowadays, the relevant parallel architectures exploiting the high performances of GPUs are clusters equipped with GPUs. They have become very attractive for high performance com- puting, given their low cost compared to their computation power and their abilities to compute faster and to consume less energy than their pure CPU counterparts [2]. GPU clus- ters have already been used to accelerate numerical computa- tions of sparse linear solvers, as [3], [7] where authors have parallelized the Conjugate Gradient solver (CG solver) on several GPU platforms and under different parameters affect- ing its performance. For example, Cevahir and al. [7] have presented the parallel implementation of a CG solver us- ing hypergraph partitioning on CPU and GPU clusters. They show that double precision CG solution with 32 GPUs on 16 nodes TSUBAME is 17.4 times faster than the CPU im- plementation of the same amount of nodes and CPU cores. However, we have noticed in the literature that most parallel implementations of sparse linear solvers were performed on small sparse linear systems not exceeding 6 · 10 6 unknowns. Therefore our purpose in this paper is to solve more effec- tively large sparse linear systems whose sizes exceed 10 7 un- knowns and to study the effect of these matrix sizes on the performances of our proposed sparse linear solver. The target sparse linear solver that we explore on a GPU cluster is a GMRES solver (Generalized Minimal RESidual), since it is a generic and efficient method to solve linear sys- tems. It gives better performances in most cases regardless of properties of associated matrices to linear systems (sparsity, symmetry, number of unknowns, type of values, ...). More- over, it is an iterative method which is well-suited for linear systems of a very large order. In fact, an iterative method com- putes a sequence of approximate solutions converging to the exact solution. In contrast, a direct method determines the ex- act solution after a finite number of operations which leads to an expensive consumptions in computation time and memory space, and which therefore is not suited for large linear sys- tems. GMRES parallelization has already been attempted on various parallel platforms using different approaches. Indeed, parallel GMRES algorithms are implemented on distributed memory architectures for large linear systems. For instance, Dias and al. [10] presented a parallel implementation of the restarted GMRES under the PVM message passing system on a MEiKO SPARC-based Computing Surface, and Couturier and al. [9] presented a parallel GMRES implementation on Grid’5000 using Java language and the MPJ library for com-