A Context-Aware Primitive for Nested Recursive Parallelism Herbert Jordan 1(B ) , Peter Thoman 1 , Peter Zangerl 1 , Thomas Heller 2 , and Thomas Fahringer 1 1 University of Innsbruck, Innsbruck, Austria {herbert,petert,peterz,tf}@dps.uibk.ac.at 2 Friedrich-Alexander-Universit¨atErlangen-N¨ urnberg, Erlangen, Germany thomas.heller@fau.de Abstract. Nested recursive parallel applications constitute an impor- tant super-class of conventional, flat parallel codes. For this class, parallel libraries utilizing the concept of tasks have been widely adapted. How- ever, the provided abstract task creation and synchronization interfaces force corresponding implementations to focus their attention to individ- ual task creation and synchronization points – unaware of their relation to each other – thereby losing optimization potential. Within this paper, we present a novel interface for task level paral- lelism, enabling implementations to grasp and manipulate the context of task creation and synchronization points – in particular for nested recur- sive parallelism. Furthermore, as a concrete application, we demonstrate the interface’s capability to reduce parallel overhead within applications based on a reference implementation utilizing C++14 template meta pro- gramming techniques to synthesize multiple versions of a parallel task during the compilation process. To demonstrate its effectiveness, we evaluate the impact of our app- roach on the performance of a series of eight task parallel benchmarks. For those, our approach achieves substantial speed-ups over state of the art solutions, in particular for use cases exhibiting fine grained tasks. 1 Introduction For the development of parallel programs, various programming language exten- sions and libraries have been created. Many of these, including OpenMP, MPI, OpenCL, or CUDA, focus on the concept of parallel loops, and variations of those, as their primary use case. In general the associated data parallelism pro- vides high degrees of concurrency, leading to scalable applications. Furthermore, the management overhead for distributing sub-ranges of parallel loops scales only with the number of processors, not the problem size itself – and is thus low. c Springer International Publishing AG 2017 F. Desprez et al. (Eds.): Euro-Par 2016 Workshops, LNCS 10104, pp. 149–161, 2017. DOI: 10.1007/978-3-319-58943-5 12