A Runtime Approach for Dynamic Load Balancing of OpenMP Parallel Loops in LLVM Jonas H. M¨ ullerKornd¨orfer, Florina M. Ciorba, Akan Yilmaz Department of Mathematics & Computer Science, University of Basel, Switzerland firstname.lastname@unibas.ch Christian Iwainsky TechnischeUniversit¨atDarmstadt, Germany christian.iwainsky@sc.tu-darmstadt.de Johannes Doerfert, Hal Finkel Argonne National Laboratory Lemont, IL, USA [jdoerfert,hfinkel]@anl.gov Vivek Kale Brookhaven National Laboratory Upton, NY, USA vkale@bnl.gov Michael Klemm Intel Deutschland GmbH Feldkirchen, Germany michael.klemm@intel.com ABSTRACT Load imbalance is the major source of performance degrada- tion in computationally-intensive applications that frequently consist of parallel loops. Efficient scheduling of parallel loops can improve the performance of such programs. OpenMP is the de-facto standard for parallel programming on shared- memory systems. The current OpenMP specification provides only three choices for loop scheduling which are insufficient in scenarios with irregular loops, system-induced interfer- ence, or both. Therefore, this work augments the LLVM implementation of the OpenMP runtime library with eleven state-of-the-art plus three new and ready-to-use scheduling techniques. We tested the existing and the added loop sched- uling strategies on several applications from the NAS, SPEC OMP 2012, and CORAL-2 benchmark suites. The experi- mental results show that each newly implemented scheduling technique outperforms the other in certain application and system configurations. We measured performance gains of up to 6% compared to the fastest previously available sched- uling techniques. This work establishes the importance of beyond-standard scheduling options in OpenMP for the ben- efit of evolving applications executing on evolving multicore architectures. KEYWORDS Scheduling; dynamic load balancing; OpenMP; LLVM. 1 INTRODUCTION Parallel and distributed applications in science, engineer- ing, and industry are complex, large, and generally, exhibit irregular and non-deterministic behavior. Moreover, their per- formance frequently relies on computationally-intensive large parallel loops. High performance computing (HPC) platforms are increasingly complex, large, heterogeneous, and exhibit massive and diverse parallelism. The execution of such appli- cations on existing HPC platforms can suffer from numerous performance degrading phenomena. Load imbalance is the major source of performance degra- dation in computationally-intensive applications [14]. On shared-memory systems, load imbalance can result from the uneven assignment of work to threads, unequal allocation of threads to processors, or system heterogeneity. The former can be mitigated via scheduling techniques that distribute the work in different manners. It is well known that no single loop scheduling technique can address all sources of load im- balance to effectively optimize the performance of all parallel applications executing on various systems. This poses the challenge of identifying the most suitable scheduling strategy for a given application-system tuple. OpenMP is the de-facto parallel programming approach for loops on shared-memory systems offering three scheduling options for work sharing loops: static, guided, and dynamic. These options are insuf- ficient for certain applications-system tuples for which other scheduling strategies can improve performance. Therefore, more scheduling techniques are needed in OpenMP. In this work, we extend the LLVM (llvm.org) OpenMP runtime li- brary (RTL), libomp, by eleven state-of-the-art scheduling techniques plus three improved implementations. We chose the LLVM implementation as it is open-source and widely used in many production and scientific parallel codes. Furthermore, libomp is highly compatible with other implementations, such as Intel, GCC, and PGI. The state-of- the-art scheduling techniques added are: Fixed Size Chunk- ing (fsc), Factoring (fac), Factoring2 (fac2), Taper (tap), Weighted Factoring (wf), Bold (bold), Adaptive Weighted Factoring with its 4 variants (awf b, awf c, awf d, and awf e), and Adaptive Factoring (af). We also made certain implementation-related improvements to the fac, fac2, and af techniques, hereafter denoted by the suffix “ a”. We con- ducted experiments and present in the accompanying poster the results of executing benchmarks from the NAS, SPEC OMP 2012, and CORAL-2 suites with the three standard (static, guided, and dynamic), one non-standard (trapezoidal), and the eleven (plus the three improved implementations) added scheduling techniques. 2 LOOP SCHEDULING IN LLVM OPENMP RTL Figure 1 illustrates the scheduling process in the LLVM OpenMP RTL. Libomp uses three main functions to per- form the scheduling of iterations from a loop onto threads: init(), next(), and finish(). The scheduling techniques