Run-Time Reference Clustering for Cache Performance Optimization Wesley K. Kaplow Boleslaw K. Szymanski Peter Tannenbaum Department of Computer Science Scientific Computation Research Center Rensselaer Polytechnic Institute Troy, N.Y. 12180-3590, USA kaploww,szymansk,tannenp @cs.rpi.edu Viktor K. Decyk Physics Department, UCLA Los Angeles, CA. 90024, USA and Jet Propulsion Laboratory California Institute of Technology Pasadena, CA. 91109, USA vdecyk@pepper.physics.ucla.edu Abstract We introduce a method for improving the cache per- formance of irregular computations in which data are referenced through run-time defined indirection arrays. Such computations often arise in scientific problems. The presented method, called Run-Time Reference Clustering (RTRC), is a run-time analog of a compile-time blocking used for dense matrix problems. RTRC uses the data par- titioning and re-mapping techniques that are a part of dis- tributed memory multi-processor codes designed to mini- mize interprocessor communication. Re-mapping each set of local data decreases cache-misses the same way re- mapping the global data decreases off-processor references. We demonstrate the applicability and performance of the RTRC technique on several prevalent applications: Sparse Matrix-Vector Multiply, Particle-In-Cell, and CHARMM- like codes. Performance results on SPARC-20, SP-2, and T3-D processors show that single node execution perfor- mance can be improved by as much as 35%. 1. Introduction One of goals of parallel program optimization is to keep data references as low as possible in the memory hierar- chy. However, this is a difficult task for irregular compu- tations that make array references via indirect indices. The indices are data-dependent, and thus it is impossible to deter- mine at compile-time a static distribution of the data that will minimize inter-processor communications. Heuristic tech- niques, such as spectral bisection, simulated annealing, etc., are used to create an irregular distribution of data in an at- tempt to minimize inter-processor communication. How- ever, these methods introduce the problems of determin- ing the run-time dependent remote access requirements of the application, and providing efficient facilities to perform the communication. These problems are addressed in the CHAOS/PARTI run-time and compilation methods [4, 12, 10]. The essential technique is the inspector/executor model in which the inspector is used to determine which references are required for execution, and the executor performs the communication and the actual computation. Cache optimization for irregular problems share the same problems of multi-processor irregular data distribution: the reference pattern is not determined until run-time, and may vary during execution. During compilation of such pro- grams it is impossible to determine a loop structure that will confine references to the current contents of the cache. It is exactly this type of problem that this paper addresses. Several methods have been explored for improving cache performance of irregular computation. Multithreading [6] attempts to hide memory latency by creating many parallel threads of computation that can be finely scheduled with re- spect to the availability of data to process. Another approach is to modify the reference order to improve locality. For cer- tain applications, such as finite-element methods, the perfor- mance of the cache can be improved by applying algorithms that narrow the bandwidth of the sparse-matrix constructed increasing reference locality [3]. However, this does not ad- dress the details of reuse, nor is it generally applicable to ir- regular problems. To our knowledge, [11] is the only paper that applies data repartitioning explicitly for cache optimiza- tion. They determine the size of a sub-domain of the local cache, based on an analysis of the data structures and algo- rithms of a problem, and then use a domain decomposition scheme at run-time to reorganize data to fit into these local cache regions. Their paper also examines software prefetch- ing. When applied to KSR-1 (a Cache Only Memory Ar- chitecture) both of these techniques reduce the number of Proc. Second Aizu Int. Symposium on Parallel Algorithms/Architectures Synthesis, Aizu-Wakamtsu, Japan, March 17-21, 1997, IEEE Computer Society Press, Los Alamitos, CA , pp. 42-49