Exploiting thread-level parallelism in the iterative solution of sparse linear systems q José I. Aliaga a,1 , Matthias Bollhöfer b,2 , Alberto F. Martín a,⇑,1 , Enrique S. Quintana-Ortí a,1 a Dpto. de Ingenierı ´a y Ciencia de Computadores, Universidad Jaime I, 12.071-Castellón, Spain b Institut Computational Mathematics, TU Braunschweig, 38106 Braunschweig, Germany article info Article history: Available online 25 November 2010 Keywords: Large sparse linear systems Factorization-based preconditioning Preconditioned conjugate gradients Task-level parallelism Shared-memory multiprocessors abstract We investigate the efﬁcient iterative solution of large-scale sparse linear systems on shared-memory multiprocessors. Our parallel approach is based on a multilevel ILU pre- conditioner which preserves the mathematical semantics of the sequential method in ILU- PACK. We exploit the parallelism exposed by the task tree corresponding to the nested dissection hierarchy (task parallelism), employ dynamic scheduling of tasks to processors to improve load balance, and formulate all stages of the parallel PCG method conformal with the computation of the preconditioner to increase data reuse. Results on a CC-NUMA platform with 16 processors reveal the parallel efﬁciency of this solution. Ó 2010 Elsevier B.V. All rights reserved. 1. Introduction One of the numerical problems which arises most frequently in modern linear algebra is the solution of large sparse sys- tems of equations. In applications involving the discretization of partial differential equations (PDEs), the efﬁcient solution of sparse linear systems is one major computational task. The same key subproblem appears in many other application areas as, e.g., in quantum physics or circuit and device simulation. This is also the case for nonlinear equations as well as large-scale eigenvalue computations since the iterative methods employed in those problems usually require the solution of multiple linear systems. As the size of the underlying applications increases (e.g., three spatial dimensions for PDEs or increasing num- ber of devices in integrated circuits), the development of fast and efﬁcient numerical solution techniques becomes crucial. Sparse direct solvers have proven to be extremely efﬁcient for a large class of application problems, but they are also known to perform poorly for others. For example, direct methods are highly efﬁcient when applied to two-dimensional (2D) PDEs, but they dramatically slow down for three-dimensional (3D) problems because of considerable ﬁll-in [1]. Approx- imate factorization techniques combined with Krylov subspace methods are an appealing alternative for these kind of appli- cation problems where ﬁll-in becomes an issue [2]. In this paper we present a parallel variant of a highly efﬁcient multilevel incomplete LU factorization (ILU) method which has been successfully applied in sparse large scale application problems with up to millions of equations [3–5]. This 0167-8191/$ - see front matter Ó 2010 Elsevier B.V. All rights reserved. doi:10.1016/j.parco.2010.11.002 q This article originally was to appear in the special issue of papers from PMAA’08 (May/June 2010), but due to delays in reviewing and revision it appears here instead. ⇑ Corresponding author. E-mail addresses: aliaga@icc.uji.es (J.I. Aliaga), m.bollhoefer@tu-bs.de (M. Bollhöfer), martina@icc.uji.es, dolbibolfo@hotmail.com (A.F. Martín), quintana@icc.uji.es (E.S. Quintana-Ortí). 1 Supported by A.I. Hispano-Alemanas HA2007-0071, CICYT project TIN2008-06570-C04-01, and the Fundación Caixa-Castelló Bancaixa/UJI project P1- 1B2009-31. 2 Supported by DAAD grant D/07/13360. Parallel Computing 37 (2011) 183–202 Contents lists available at ScienceDirect Parallel Computing journal homepage: www.elsevier.com/locate/parco