Dense Symmetric Indefinite Factorization on GPU Accelerated Architectures Marc Baboulin 1 , Jack Dongarra 2 , Adrien R´ emy 1 , Stanimire Tomov 2 , and Ichitaro Yamazaki 2 1 Universit´ e Paris-Sud and Inria, France [baboulin,aremy]@lri.fr 2 University of Tennessee, Knoxville, USA [dongarra,tomov,iyamazak]@eecs.utk.edu Abstract. We study the performance of dense symmetric indefinite fac- torizations (Bunch-Kaufman and Aasen’s algorithms) on multicore CPUs with a Graphics Processing Unit (GPU). Though such algorithms are needed in many scientific and engineering simulations, obtaining high performance of the factorization on the GPU is difficult because the piv- oting, required to ensure the numerical stability of the factorization, leads to frequent synchronizations and irregular data accesses. As a result, un- til recently, there has not been any implementation of these algorithms on hybrid CPU/GPU architectures. To improve their performance on the hybrid architecture, we explore different techniques to reduce the expen- sive communication and synchronization between the CPU and GPU, or on the GPU. We also study the performance of an LDL T factorization with no pivoting combined with the preprocessing technique based on Random Butterfly Transformations. Though such transformations only have probabilistic results on the numerical stability, they avoid the piv- oting and obtain a great performance on the GPU. Keywords: dense symmetric indefinite factorization, communication- avoiding, randomization, GPU computation. 1 Introduction A symmetric matrix A is called indefinite when its quadratic form x T Ax can take both positive and negative values. Dense linear systems of equations with sym- metric indefinite matrices appear in many studies of physics, including physics of structures, acoustics, and electromagnetism. For instance, such systems arise in the linear least-squares problem for solving an augmented system [15, p. 77], or in the electromagnetism where the discretization by the Boundary Element Method results in linear systems with dense complex symmetric (non Hermi- tian) matrices [21]. The efficient solution of these linear systems demands a high performance implementation of a dense symmetric indefinite solver that can effi- ciently use the current hardware architecture. In particular, the use of accelera- tors has become pervasive in scientific computing due to their high-performance