Auto-tuning of the FFTW Library for Massively Parallel Supercomputers Massimiliano Guarrasi a , Giovanni Erbacci a and Andrew Emerson a a CINECA, Italy Abstract In this paper we present the work carried out by CINECA in the framework of the PRACE-2IP project which had the aim of improving the performance of the FFTW library by refining the auto-tuning mechanism that is already implemented in this library. This optimization was realized with the following activities: Identification of the major bottlenecks present in the current FFTW implementation; Investigation of the auto-tuning mechanism provided in FFTW in order to understand how performance is affected by domain decomposition; Introduction of a new parallel domain decomposition; Construction of a library to improve the performance of the auto-tuning mechanism. In particular, we have compared the performance of the standard Slab Decomposition algorithm already present with that obtained using the 2D Domain Decomposition and we found that on massively parallel supercomputers the performance of this new algorithm is significantly higher. 1.Introduction Currently many challenging scientific problems require the use of Discrete Fourier Transform algorithms (DFT, e.g.:(1) ) with one of the most popular libraries used by the scientific community being the FFTW library ((2),(3)). This library, which is free software, is a C subroutine library for computing DFTs in one or more dimensions with arbitrary input size consisting of both real and complex data. FFTW can also compute discrete Hartley transforms (DHT) of real data and can have arbitrary length. FFTW employs O(n·log(n)) algorithms for all lengths, supports arbitrary multi-dimensional data and includes parallel (multi-threaded) transforms for shared-memory systems and distributed-memory parallel transforms using MPI libraries. FFTW does not use a fixed algorithm for computing the transform, but instead it adapts the DFT algorithm to the underlying hardware in order to maximize performance. Hence, the computation of the transform is split into two phases. First, FFTW's planner “learns” the fastest way to compute the transform on the selected machine. The planner produces a data structure called a plan that contains this information. Subsequently, the plan is executed to transform the array of input data as dictated by the plan. The plan can be reused as many times as needed. In typical high-performance applications, many transforms of the same size are computed and, consequently, a relatively expensive initialization of this sort is acceptable. On the other hand, if you need a single transform of a given size, the one-time cost of the planner becomes significant. For this case, FFTW provides fast planners based on heuristics or on previously computed plans. During plan creation, users can choose the method that he/she prefers using the FFT_MEASURE (obtaining a more accurate plan) flags or FFTW_ESTIMATE flags (obtaining the plan faster). Currently, particularly using small size data arrays, the FFTW libraries have been shown to not scale well beyond a few hundred cores. Considering that current PRACE Tier-0 systems consist of several hundreds of thousands of cores, and that in order to obtain an access on these systems a good scalability of a few thousand of cores at least is required, there is a clear need to improve FFTW implementations on massively parallel supercomputers. For this purpose, a large amount of time must be first dedicated to performing extensive benchmarks in order to find the major 1 Available online at www.prace-ri.eu Partnership for Advanced Computing in Europe