Comparison Of Single Source Shortest Path Algorithms On Two Recent Asynchronous Many-task Runtime Systems Jesun Sahariar Firoz, Martina Barnas, Marcin Zalewski and Andrew Lumsdaine jsfiroz,mbarnas,zalewski,lums@indiana.edu Center for Research in Extreme Scale Technologies (CREST) Indiana University, Bloomington, IN, USA Abstract—With the advent of the exascale era, new runtimes and algorithm design techniques need to be explored. In this paper, we investigate performance of three different single-source shortest path algorithms in two relatively recent asynchronous many-task runtime systems AM ++ and HPX-5. We identify the underlying set of differential features for these runtimes, and we compare and contrast the performance of Δ-stepping algorithm, Distributed Control based algorithm, K-level Asynchronous algo- rithm in AM ++ and in HPX-5, for which we also include chaotic implementation. We observe that specific runtime characteristics or lack thereoff and different graph inputs can impact the feasibility of an algorithmic approach. I. I NTRODUCTION Numerous efforts are ongoing all over the world to design and prototype runtime systems for exascale machines [1, 2]. Various runtime systems differ in a set of characteristics. Based on those distinguishing characteristics, algorithm execution time may vary significantly in different runtimes. For portability of algorithms, we need to understand what runtime features are needed to support efficient execution of an algorithm. In addition, it is essential to note that, to utilize the full potential of exascale systems, it is desirable to keep all computational elements of the entire machine busy as much as possible, while decreasing the execution time of an algorithm. To this end, algorithms need to be designed for asynchrony and distributed over computational nodes. In this study, we have chosen the prototypical irregular problem of graph traversal, specifically single source shortest path (SSSP). Most distributed implementations of SSSP are an incarnation of coarse-grained bulk synchronous parallel (BSP) [3] “compute-communicate” model, and they suffer from the associated synchronization overheads. The manifestation of synchronization ovehead is likely to amplify profoundly on exascale machines. Currently, BSP-based systems like Pregel [4], GraphLab [5], and Power- Graph [6], for example, use the gather-apply-scatter model to process a problem in steps. The Parallel Boost Graph Library (PBGL) started with such a model too [7]. Applying BSP strictly limits the amount of possible parallelism to a parallel Dijkstra algorithm that uses a concurrent priority queue, a slightly more optimistic [8] algorithm, which relaxes the priority constraints, or the Bellman-Ford algorithm [9], which allows processing of the “front” of SSSP search. Δ-stepping [10] approach, for example used by Edmonds et al. [7] and Chakaravarthy et al. (hybridized with Bellman-Ford) [11], relaxes the BSP approach to supersteps with unordered optimistic parallelism within each step with limited step size to prevent work explosion. Global BSP control introduces the straggler effect where the whole distributed system must wait for the last straggler to move to the next step. The larger the system gets, the more pronounced the effect. To mitigate this, in Distributed Control (DC)[12] approach, instead of using global synchronization, we approximately order computation locally. Very recently, k- level asynchronous (KLA) paradigm [13] has been proposed where the level of asynchrony can be controlled parametrically. Graph traversal is a basic building block of other graph algorithms used in social network analytics, transportation optimization, artificial intelligence, power grids, and, in general, any problem where data consists of entities that connect and interact in irregular ways. In addition to its relevance to a wide range of applications, more importantly for our interests graph traversal comprises a paragon of an irregular application that is sensitive to the whole hardware/software stack. This is recognized by the current Graph500 benchmark [14], which is based on breadth-first search (BFS) with a proposal to extend the benchmark with SSSP, and provides ranking of HPC machines. We acknowledge that graph inputs of the sizes considered here (and elsewhere in literature) do not necessitate use of supercomputing; rather, we utilize their characteristics to advance our knowledge of highly complex issues pertinent to exascale computing. To address the challenges of exascale, we are developing a runtime system called High Performance ParalleX 5 (HPX-5) [15], based on the ParalleX computation model [16–18]. HPX-5 provides some key features of the ParalleX model, e.g., active global address space (AGAS), local control objects (LCOs) such as futures, and execution based on interdependent light-weight tasks. Following the philosophy of design of Standard Template Library (STL) [19] and the trail of evolution of Parallel Boost graph library(PBGL) [7], we have began our endeavor to develop a next-generation distributed graph library based on HPX-5. Our first experiments are in implementing graph algorithms for solving SSSP problem, a good representative of a class of irregular graph problems. We adopt co-design approach where we use lessons learned from implementing distributed graph algorithms to guide development of HPX-5, and vice versa. Previously, our PBGL-2 graph library was based on the AM ++ implementation [20] of the Active Pebbles model [21], in which algorithms are expressed by unbounded-depth active