A Microbenchmark Study of OpenMP Overheads Under Nested Parallelism Vassilios V. Dimakopoulos Panagiotis E. Hadjidoukas Giorgos Ch. Philos Department of Computer Science, University of Ioannina Ioannina, Greece, GR-45110 {dimako,phadjido,gfilos}@cs.uoi.gr Abstract In this work we present a microbenchmark methodology for assessing the over- heads associated with nested parallelism in OpenMP. Our techniques are based on extensions to the well known EPCC microbenchmark suite that allow measuring the overheads of OpenMP constructs when they are effected in inner levels of par- allelism. The methodology is simple but powerful enough and has enabled us to gain interesting insight into problems related to implementing and supporting nested parallelism. We measure and compare a number of commercial and free- ware compilation systems. Our general conclusion is that while nested parallelism is fortunately supported by many current implementations, the performance of this support is rather problematic. There seem to exist issues which have not yet been addressed effectively, as most OpenMP systems do not exhibit a graceful reaction when made to execute inner levels of concurrency. 1 Introduction OpenMP [1] has become a standard paradigm for shared memory programming, as it offers the advantage of simple and incremental parallel program development, in a high abstraction level. Nested parallelism has been a major feature of OpenMP since its very beginning. As a programming style, it provides an elegant solution for a wide class of parallel applications, with the potential to achieve substantial processor utilization, in situations where outer-loop parallelism simply can not. Despite its significance, nested parallelism support was slow to find its way into OpenMP implementations, commercial and research ones alike. Even nowadays, the level of support is varying greatly among compilers and runtime systems. For applications that have enough (and balanced) outer-loop parallelism, a small number of coarse threads is usually enough to produce satisfactory speedups. In many other cases though, including situations with multiple nested loops, or recursive and irregular parallel applications, threads should be able to create new teams of threads 1