Communication-Optimal Parallel N-body Solvers Aparna Chandramowlishwaran, Richard Vuduc School of Computational Science and Engineering, Georgia Institute of Technology, Atlanta, GA Abstract We present new analysis, algorithmic techniques, and implementations of the Fast Multipole Method (FMM) for solving N-body problems. Our research specifically addresses two key challenges. The first chal- lenge is how to engineer fast code for today’s plat- forms. We present the first in-depth study of multi- core optimizations and tuning for FMM, along with a systematic approach for transforming a conventionally- parallelized FMM into a highly-tuned one. We in- troduce novel optimizations that significantly improve the within-node scalability of the FMM, thereby en- abling high-performance in the face of multicore and manycore systems. The second challenge is how to understand scalability on future systems. We present a new algorithmic complexity analysis of the FMM that considers both intra- and inter-node communica- tion costs. This analysis yields the surprising predic- tion that although the FMM is largely compute-bound today, and therefore highly scalable on current sys- tems, the trajectory of processor architecture designs— if there are no significant changes—could cause it to be- come communication-bound as early as the year 2020. This prediction suggests the utility of our analysis ap- proach, which directly relates algorithmic and architec- tural characteristics, for enabling a new kind of high- level algorithm-architecture co-design. 1 Introduction The fast multipole method (FMM) applies broadly to a variety of scientific particle simulations used to study electromagnetic, fluid, and gravitational phenom- ena. It is regarded as one of the most important algo- rithms in scientific and engineering computing [4]. A deep understanding of how to improve its scalability has wide-ranging implications, not just for physical science and engineering applications, but also for those in the emerging area of massive-scale statistical data analysis. Importantly, the FMM has asymptotically optimal time complexity with guaranteed approximation accu- racy. From a performance analysis and engineering perspective, it consists of different components exhibit- ing different compute and memory characteristics, and therefore serves as an instructive case study for improv- ing performance and scalability in broader contexts. Engineering Fast Algorithms: Given the impor- tance of the FMM, we require an optimized and scal- able implementation to achieve the performance targets set forth by exascale. We present the first in-depth study of the performance optimization and scaling of FMM on multicore platforms. We document the steps in im- proving the scalability of FMM as a systematic process which could be automated by a performance analysis tool. The study focuses on single-node performance since it is a critical building-block in scalable multi- node distributed memory codes. Communication Costs for FMM: The total exe- cution time consists of computation time (flops) and communication time (to move data to and from mem- ory and/or between processors). Processor technology trends [8] suggest that under the current trajectory of processor design, computation time will decrease at a much faster rate than communication time. To design efficient scalable algorithms, it is not only sufficient to minimize computation time, but we also need to min- imize communication time. This necessitates under- standing the communication costs involved with every algorithm. We present a new analysis of memory hi- erarchy communication for the FMM. Our analysis re- fines the estimates of the constants, normally ignored in traditional asymptotic analyses of the FMM, with cali- bration against our state-of-the-art implementation. Our analytical performance model is the first for the FMM to capture not only algorithmic tuning knobs, but also ar- chitectural parameters, thereby enabling high-level pre- diction and algorithm-architecture co-design [3]. 2 Fast Multipole Method This section provides a brief overview of the fast multipole method (FMM). For more in-depth algorith- mic details, see Greengard, et al. [5, 9]. Given a system of N source particles, with positions given by {y 1 ,...,y N }, and N targets with positions {x 1 ,...,x N }, we wish to compute the N sums, 1