STAPL: A Standard Template Adaptive Parallel C++ Library Ping An, Alin Jula, Silvius Rus, Steven Saunders, Tim Smith, Gabriel Tanase, Nathan Thomas, Nancy M. Amato, Lawrence Rauchwerger Abstract—The Standard Template Adaptive Parallel Library (STAPL) is a parallel library designed as a superset of the ANSI C++ Standard Template Library (STL). It is sequentially consistent for functions with the same name, and executes on uni- or multi-processor systems that uti- lize shared or distributed memory. STAPL is implemented using simple parallel extensions of C++ that currently provide a SPMD model of paral- lelism, and supports nested parallelism. The library is intended to be gen- eral purpose, but emphasizes irregular programs to allow the exploitation of parallelism in areas such as particle transport calculations, molecular dynamics, geometric modeling, and graph algorithms, which use dynami- cally linked data structures. STAPL provides several different algorithms for some library routines, and selects among them adaptively at run-time. STAPL can replace STL automatically by invoking a preprocessing trans- lation phase. The performance of translated code is close to the results obtained using STAPL directly (less than 5% performance deterioration). However, STAPL also provides functionality to allow the user to further optimize the code and achieve additional performance gains. We present results obtained using STAPL for a molecular dynamics code and a particle transport code. I. MOTIVATION In sequential computing, standardized libraries have proven to be valuable tools for simplifying the program development process by providing routines for common operations that allow programmers to concentrate on higher level problems. Simi- larly, libraries of elementary, generic, parallel algorithms pro- vide important building blocks for parallel applications and spe- cialized libraries [9], [8], [24]. Due to the added complexity of parallel programming, the potential impact of libraries could be even more profound than for sequential computing. Indeed, we believe parallel libraries are crucial for moving parallel comput- ing into the mainstream since they offer the only viable means for achieving scalable performance across a variety of applica- tions and architectures with programming efforts comparable to those of developing sequential codes. In particular, properly designed parallel libraries could insulate less experienced users from managing parallelism by providing routines that are easily interchangeable with their sequential counterparts, while allow- ing more sophisticated users sufficient control to achieve higher performance gains. Designing parallel libraries that are both portable and efficient is a challenge that has not yet been met. This is due mainly to the difficulty of managing concurrency and the wide variety of parallel and distributed architectures. For example, due to the differing costs of an algorithm’s communication patterns on Dept. of Computer Science, Texas A&M University, College Sta- tion, TX 77843-3112. Email: pinga, alinj, silviusr, sms5644, tgs7381, gabrielt, nthomas, amato, rwerger @cs.tamu.edu. Research was supported in part by NSF by CAREER awards CCR-9624315 and CCR-9734471, by NSF grants IRI-9619850, ACI-9872126, EIA-9805823, EIA-9810937, EIA-9975018, by DOE ASCI ASAP Level 2 grant B347886, by Sandia National Laboratory, and by a Hewlett-Packard Equipment Grant. Thomas supported in part by a Dept. of Education Graduate Fellowship. different memory systems, the best algorithm on one machine is not necessarily the best on another. Even on a given machine, the algorithm of choice may vary according to the data and run- time conditions (e.g., network traffic and system load). Another important constraint on the development of any soft- ware package is its inter-operability with existing codes and standards. The public dissemination and eventual adoption of any new library depends on how well programmers can inter- face old programs with new software packages. Extending or building on top of existing work can greatly reduce both devel- oping efforts and the users’ learning curve. To liberate programmers from the concerns and difficulties mentioned above we have designed STAPL (Standard Template Adaptive Parallel Library). STAPL is a parallel C++ library with functionality similar to STL, the ANSI adopted C++ Standard Template Library [20], [26], [16]. To ease the transition to par- allel programming and ensure portability across machines and programs, STAPL is a superset of STL that is sequentially con- sistent for functions with the same name. STAPL executes on uni- or multi-processor architectures with shared or distributed memory and can co-exist in the same program with STL func- tions. STAPL is implemented using simple parallel extensions of C++ which provide a SPMD model of parallelism and sup- ports nested (recursive) parallelism (as in NESL [9]). In a departure from previous parallel libraries which have almost ex- clusively targeted scientific or numerical applications, STAPL emphasizes irregular programs. In particular, it can exploit par- allelism when dynamic linked data structures replace vectors as the fundamental data structure in application areas such as geo- metric modeling, particle transport, and molecular dynamics. STAPL is designed in layers of abstraction: (i) the interface layer used by application programmers that is STL compatible, (ii) the concurrency and communication layer that expresses parallelism and communication/synchronization in a generic manner, (iii) the software implementation layer which instan- tiates the concurrency and communication abstractions to high level constructs (e.g., m fork and parallel do for concurrency, and OpenMP synchronizations and MPI primitives for commu- nication), and (iv) the machine layer which is OS, RTS and ar- chitecture dependent. The machine layer also maintains a per- formance data base for STAPL on each machine and environ- ment. STAPL provides portability across multiple platforms by in- cluding its own (adapted from [21]) run-time system which sup- ports high level parallel constructs (e.g., forall). Currently STAPL can interface directly to Pthreads and maintains its own scheduling. It can issue MPI and OpenMP directives and can use the native run-time system on several machines (e.g., HP