Parallel Preconditioning for Spherical Harmonics Expansions of the Boltzmann Transport Equation Karl Rupp *† , Tibor Grasser * and Ansgar J¨ ungel * Institute for Microelectronics, TU Wien. Gußhausstraße 27-29/E360, A-1040 Wien, Austria Institute for Analysis and Scientific Computing, TU Wien. Wiedner Hauptstraße 8-10/E101, A-1040 Wien, Austria Email: {rupp,grasser}@iue.tuwien.ac.at, juengel@asc.tuwien.ac.at Abstract—While the Monte Carlo method for the Boltzmann transport equation for semiconductors has already been paral- lelized, this is much more difficult to accomplish for the deter- ministic spherical harmonics expansion method which requires the solution of a linear system of equations. For the typically employed iterative solvers, preconditioners are required to obtain good convergence rates. These preconditioners are serial in nature and cannot be applied efficiently in a black-box manner to arbitrary systems. Motivated by the underlying physical processes, we present a parallel block-preconditioning scheme that allows us to use existing serial preconditioners in a parallel setting. A reduction of execution times by up to one order of magnitude on current multi-core processors as well as graphics processing units is observed. I. I NTRODUCTION Since its introduction in the early 1990s, the spherical harmonics expansion (SHE) method has become an attractive alternative to the stochastic Monte Carlo method for the numerical solution of the Boltzmann transport equation (BTE). While the application of the SHE method has long been restricted to one-dimensional device simulations due to high memory requirements, enough memory is available on modern computers to allow for two-dimensional device simulations [1]. With the emergence of parallel computing architectures in desktop computers, parallel algorithms increase in attractive- ness. Recently, massively parallel computing architectures in the form of graphics processing units (GPUs) for the use as accelerators have gained a lot of popularity. While fully parallel implementations of the Monte Carlo method have already been reported [2], an artificial restriction of the SHE method to a single CPU core would be detrimental to the attractiveness of the method. The SHE method ultimately leads to the solution of large systems of linear equations, typically employed within a nonlinear iteration scheme to ensure self-consistency of the BTE with the Poisson equation. Due to the large number of unknowns, iterative solution methods have to be used for the solution of these systems. The convergence rate of such iterative methods can be substantially improved by the use of preconditioners. As discussed by Jungemann et al. [3], the system of linear equations resulting from the SHE equations requires a good preconditioner in order to obtain reasonable rates. In recent publications on the SHE method [1], [3], an incomplete LU factorization (ILU) preconditioner was used for that purpose. ILU is a widely accepted black-box preconditioner [4], but it is in its pure form restricted to single- threaded execution. Even though parallel block-variants of ILU as well other parallel preconditioning techniques such as sparse approximate inverses [5] or polynomial preconditioners have been developed, their convergence enhancement can be typically considerably lower than single-threaded variants [4], [6]. The purpose of this work is to show that the preconditioner for the SHE method can be well parallelized. We study the structure of the linear system resulting from the SHE method and propose a general block-preconditioning scheme which can also be used with serial preconditioners. We demonstrate a considerable reduction of execution times on multi-core CPUs as well as on GPUs by employing the initially serial ILU preconditioner within the proposed parallel framework. II. PHYSICS-BASED BLOCK-PRECONDITIONING In operator form, the SHE equations in steady state can be written as L l,m {f } = Q l,m {f } , l =0,...,L, m = -l,...,l, where L l,m and Q l,m denote the projections of the stream- ing operator and the scattering operator onto the spherical harmonics Y l,m , respectively. Employing the H -transform [1], [7], carrier trajectories in free flight are given by hyperplanes of constant total energy H in the simulation domain (x,H ), cf. Fig. 1. This is reflected in the model by the fact that L l,m does not couple any of the, say, N H different energy levels in the simulation domain. Carriers within the device can change their total energy only by inelastic scattering events, thus the scattering operator Q l,m {f } is responsible for coupling different energy levels. However, if only elastic scattering processes are considered, the total energy of the involved particles remains unchanged and the different energy levels do not couple. Therefore, in a SHE simulation using only elastic scattering and N H different energy levels, the resulting system of linear equations is consequently decoupled into N H independent problems. Such a decomposition has been observed already in early publications on SHE [8], but it has been of no practical relevance since inelastic scattering processes are essential for predictive device simulation.