A Framework for an Automatic Hybrid MPI+OpenMP code generation Khaled Hamidouche, Joel Falcou, Daniel Etiemble Laboratoire de Recherche en Informatique University Paris-Sud XI, Orsay, France hamidou,joel.falcou,de@lri.fr Keywords: Generic Programming, Bulk Synchronous Parallelism, Performance Prediction, OpenMP, MPI, MPI+OpenMP Abstract Clusters of symmetric multiprocessors (SMPs) are the most currently used architecture for large scale applications and combining MPI and OpenMP models is regarded as a suit- able programming model for such architectures. But writing efficient MPI+OpenMP programs requires expertise and per- formance analysis to determine the best number of processes and threads for the optimal execution for a given application on a given platform. To solve these problems, we propose a framework for the development of hybrid MPI+OpenMP programs. This paper provides the following contributions: (i) A compiler analyser that estimates the computing time of a sequential function. (ii) A code generator tool for generating hybrid code based on the compiler analyser and a simple analytical parallel per- formance prediction model to estimate the execution time of an hybrid program. (iii) An evaluation of the accuracy of the framework and its usability on several benchmarks. 1. INTRODUCTION Clusters of symmetric multiprocessors (SMP) are the most cost-effective solution for the large scale applications. In the Top500 list of supercomputers, most of them (if not all) are clusters of SMPs. In this context, using a hybrid MPI+OpenMP approach is regarded as a suitable program- ming model for such architectures. Some works [8, 13] have presented the performance improvement when using MPI+OpenMP. On the other hand, there are also significant reports of poor hybrid performance [7, 10] or minor benefits when adding OpenMP to an MPI program [5, 18]. The factors that impact the performance of this approach are numerous, complex and interrelated: • MPI communications efficiency: performance of the MPI communication constructs (point to point, broad- cast, all-to-all, ...), message sizes, interconnection la- tencies and bandwidths are key factors. applications re- lated problems such as MPI routine types (point to point, collective), message size and network related problems such as network connection and bandwidth. • OpenMP parallelization efficiency: using critical section primitives, the overhead of OpenMP threads manage- ment and false sharing can reduce the performances. • MPI and OpenMP interaction: load balancing issues and idle thread inside the MPI communication part reduce the parallel efficiency. To obtain an efficient hybrid program one must determine the number of MPI processes and OpenMP threads to be launched on a given platform for a given data size. With N nodes, each of them having p cores, it is customary launch N processes with p threads per process. This choice raises two potential sources of performance loss. First, as the volume of interprocessor communication increases with the number of processes, the consequent overhead could become a disad- vantage. Second, concurrency between threads (data and/or explicit synchronization) is also a factor that limits the scal- ability, thereby leading to performance stagnation or slow- down. In addition to this performance aspect, there is a tech- nical effort to derive and deploy hybrid programs. to solve these problems, developers need solutions in terms of effec- tive automatic parallelization tools and libraries. The main contribution of this work is the design and devel- opment of a new framework that is able to produce an efficient hybrid code for multi-core architectures from the source code of a sequential function. Some consequences of this contribu- tion are: • The developers can generate an efficient hybrid code without learning new materials, just by using a list of sequential functions. • The integration of the existing legacy codes, such as the codes using the SPMD model, can be ported without sig- nificant loss of efficiency. It is the case for many sequen- tial or MPI codes. This paper is organized as follows: Section 2 presents the related works. Section 3 describes the framework and its ar- chitecture, The experimental results on various HPC algo- rithms are given in Section 4. Finally, Section 5 sums up our contributions and opens on future works. 2. RELATED WORK 2.1. Existing tools The MPI+OpenMP programming model has been exten- sively discussed in the literature, including source analysers, performance-analysis and parallel computing models.