1 from SIAM News, Volume 33, Number 6 APPLICATIONS ON ADVANCED ARCHITECTURE COMPUTERS Greg Astfalk, Editor Portable Memory Hierarchy Techniques For PDE Solvers: Part II By Craig C. Douglas, Gundolf Haase, Jonathan Hu, Markus Kowarschik, Ulrich Rüde, and Christian Weiss The first part of this article, which appeared in the June issue of SIAM News, detailed the architecture and behavior of microprocessor caches. The impact of caches on the performance of applications was also made clear. Here, in the second and concluding part of the article, we apply the lessons of the previous article to application codes, and we quantify the performance benefits that can be achieved by the user who thinks and codes in a “cache-aware” fashion. Numerical Techniques for PDEs Computational solutions for PDEs involve discretization and linearization. The problem then becomes one of linear algebra: solving a linear system with several millions of unknowns. These systems are too large to be solved by classical direct methods on any contemporary computer. Direct methods are not competitive on these very large systems because they cannot efficiently exploit the sparsity of the system matrices. The methods of choice are all iterative. Instead of an elimination process, iterative methods use repeated multiplications of the solution vector with the system matrix. Even when a significant (but limited) number of iterations are required, an iterative method is usually less costly than a direct method. The most basic iterative solvers are the Jacobi and the Gauss–Seidel methods. Although not competitive with the more advanced methods, they are useful either as models for more complex methods or as building blocks from which better methods can be constructed. The latter is true particularly for multigrid methods, which are the fastest known solvers for many elliptic PDEs. In multigrid, application of a method like Gauss–Seidel for several iterations—known as the smoother— is followed by the activation of coarser or finer grids, with additional smoothing steps. Repeated execution of a simple method like Gauss–Seidel is responsible for most of the computational cost in a multigrid algorithm—commonly more than 80% for a simple multigrid solver. Logically Tensor Product Grid-based Problems Many PDE problems today are still solved on uniform, tensor product, or logically uniform grids. A problem may be part of a domain-decomposed grid with different mesh spacings in each of the domains. Researchers in many disciplines use grids of this type to attain high levels of performance, even though the number of vertices in the global problem can be greater than it would be with an adaptively chosen, (quasi)unstructured grid. With a structured grid, it is natural to use a two- or three-dimensional array for storing the vector of unknowns. The system matrix can also be stored in a grid-oriented manner. If the PDE has constant coefficients, no matrix needs to be stored—all the entries (corresponding to the same edge in the mesh) are identical. If this is the case, only a few vectors (e.g., the solution, the right-hand side, and possibly the residual) have to be stored. In general, the memory needed is a small multiple of the size of the solution vector. No matter how the data is stored, the fundamental memory access problem is a consequence of the way all iterative algorithms are designed. The core of these methods consists of repeated matrix–vector multiplications or simple variants. In some cases (e.g., relaxation methods like Gauss–Seidel) the matrix–vector multiplication is hidden by being directly combined with an update of the solution vector. Generally, the full data set for an approximate solution must be completely read from memory for use by the processor, and it must then be written back to memory. In the interesting cases, the vector and the matrix together are too large to be stored in any of the caches. As the vector and matrix elements are accessed, some of the older elements are expelled from the cache. With an LRU (least recently used) strategy for cache replacement, the first vector and matrix elements are no longer in cache when the sweep completes. Every sweep must start from scratch and load everything from main memory; all or almost all the data from the previous sweep will have been displaced from the cache. Temporal locality can be exploited only when the whole data set is small enough to fit into the cache. Consider a standard implementation of a two-dimensional, red–black, Gauss–Seidel relaxation method based on a five-point discretization of the Laplace operator, as shown in Figure 1. The matrix is not stored, because we use a uniform mesh. The coefficients are typically kept in registers. The data that has to be loaded from main memory consists of the solution vector and the right-hand side vector. This problem is particularly difficult to optimize because so few variables are available. The runtime behavior of the standard red–black Gauss–Seidel program on a Digital PWS 500au is summarized in Table 1. For the smallest grid size, the floating-point performance is good when compared with the peak performance of 1 Gflops. Increasing