Parallelizing Irregular Applications through the YAPPA Compilation Framework Silvia Lovergine *† , Antonino Tumeo * , Oreste Villa ‡ , Fabrizio Ferrandi † † Politecnico di Milano, DEIB - Milano, Italy * Pacific Northwest National Laboratory - Richland, WA, USA ‡ NVIDIA - Santa Clara, USA 1. INTRODUCTION Modern High Performance Computing (HPC) clusters are composed of hundred of nodes integrating multicore proces- sors with advanced cache hierarchies. These systems can reach several petaflops of peak performance, but are opti- mized for floating point intensive applications, and regu- lar, localizable data structures. The network interconnec- tion of these systems is optimized for bulk, synchronous transfers. On the other hand, many emerging classes of scientific applications (e.g., computer vision, machine learn- ing, data mining) are irregular [1]. They exploit dynamic, linked data structures (e.g., graphs, unbalanced trees, un- structured grids). Such applications are inherently paral- lel, since the computation needed for each element of the data structures is potentially concurrent. However, such data structures are subject to unpredictable, fine-grained ac- cesses. They have almost no locality, and present high syn- chronization intensity. Distributed memory systems are nat- urally programmed with Message Passing Interface (MPI). Moreover, Single Program, Multiple Data (SPMD) control models are usually employed: at the beginning of the appli- cation, each node is associated with a process that operates on its own chunk of data. Communication usually happens only in precise application phases. Developing irregular ap- plications with these models on distributed systems poses complex challenges and requires significant programming ef- forts. Irregular applications employ datasets very difficult to partition in a balanced way, thus shared memory abstrac- tions, like Partitioned Global Address Space (PGAS), are preferred. In this work we introduce YAPPA (Yet Another Parallel Programming Approach), a compilation framework, based on the LLVM compiler, for the automatic parallelization of irregular applications on modern HPC systems. We briefly introduce an efficient parallel programming approach for these applications on distributed memory systems. We pro- pose a set of compiler transformations for the automatic parallelization, which can reduce development and optimiza- tion effort, and a set of transformations for improving the performance of the resulting parallel code, focusing on irreg- ular applications. We implemented these transformation in LLVM and evaluated a first prototype of the framework on a common irregular kernel (graph Breadth First Search). 2. PROPOSED APPROACH This section briefly introduce GMT (Global Memory and Threading library) [2], the run-time library we use for im- proving the performance of irregular applications. We then describe the YAPPA compilation framework, mapped on top of GMT, for parallel code generation. 2.1 GMT Library GMT is a run-time library for irregular applications on distributed memory HPC systems. It provides an API to produce a parallel version of a sequential application. GMT is built around three main concepts: global address space, latency tolerance through fine-grained software multithread- ing, and aggregation. Fist, GMT implements a PGAS data model, which enables a global address space across the dis- tributed memory of the system, without loosing the con- cept of data locality. This allows developing the applica- tion without partitioning the data set. Parallelism is ex- pressed in form of parallel for constructs. GMT implements a fork/join control model. With respect to SPMD control models, typical of message passing, or PGAS programming models, this model better copes with the large amounts of fine-grained and dynamic parallelism of irregular applica- tions. Second, GMT implements lightweight software mul- tithreading, which allows tolerating latencies for accessing data at remote locations. It provides primitives for atomic operations, which allow to manage synchronization among the nodes. When a core executes a task that issues an op- eration to a remote memory location, it switches to another task while the memory operation completes, hiding the ac- cess latency with other computation. Finally, GMT aggre- gates the commands directed towards each node to reduce overheads for fine-grained network transactions. 2.2 The YAPPA Parallelizing Compiler YAPPA extends the LLVM compiler through a set of transformations and optimizations for irregular applications. It targets the GMT run-time, executing on commodity clus- ters. It takes in input a C/C++ application, manually in- strumented by the programmer with synchronization prim- itives (e.g., atomic addition, compare-and-swap, etc.), and produces a parallel C/C++ version, by instrumenting the LLVM’s intermediate representation with GMT primitives. The transformations consists in two steps: data manage- ment and loop parallelization. In the first step, YAPPA identifies shared data, and it transform the access to such data in global memory accesses. It also tries to move at the beginning of the loop as many independent memory oper- ations as possible, substituting blocking memory operation primitives with their equivalent non-blocking versions (un- blocking of memory accesses). Even if GMT has an aggrega- tion mechanism, YAPPA tries to extract as much parallelism as possible (with the unblocking transformation) while at the same time reducing the overhead due to numerous transfers, 1