Extending OpenMP to Support Slipstream Execution Mode Khaled Z. Ibrahim and Gregory T. Byrd Dept. of Electrical and Computer Engineering, North Carolina State University {kzmousta, gbyrd}@ece.ncsu.edu Abstract OpenMP has emerged as a widely accepted standard for writing shared memory programs. Hardware-specific ex- tensions such as data placement are usually needed to im- prove the scalability of applications based on this standard. This paper investigates the implementation of an OpenMP compiler that supports slipstream execution mode, a new optimization mechanism for CMP-based distributed shared memory multiprocessors. Slipstream mode uses additional processors to reduce communication overhead, rather than to increase parallelism. We discuss how each OpenMP construct can be imple- mented to take advantage of slipstream mode, and we present a minor extension that allows runtime or compile-time con- trol of slipstream execution. We also investigate the interac- tion between slipstream mechanisms and OpenMP schedul- ing. Our implementation supports both static and dynamic scheduling in slipstream mode. We extended the Omni OpenMP compiler to generate bi- naries that support slipstream mode, and we show the per- formance of slipstream-enabled codes using OpenMP codes from the NAS Parallel Benchmark suite, running on the SimOS simulator. Our extension to OpenMP allowed the benchmarks to achieve an average performance improve- ment of 14% with static scheduling. For dynamic scheduling the performance improvement is 12% on average. 1. Introduction OpenMP [3] is a directive-based standard for shared- memory parallel programming. It allows simple incremental parallelization of applications by identifying loops and other regions of code that can be computed in parallel. OpenMP does not provide facilities to control data locality or coher- ence, as these features are platform dependent. Portability of OpenMP applications puts the burden on compilers and hardware to achieve good performance. This work supported in part by the NSF Computer Systems Architec- ture program, contract CCR-0105628. While a compiler can do analysis to remove unnecessary synchronization and to optimize for locality of data accesses, the overhead of parallelization vs. the performance gain can- not always be determined at compile time. For example, we cannot determine if parallelizing a certain loop will be worthwhile without knowing the loop iteration count, which can be a runtime variable. Likewise, the upper limit of par- allelization for decent performance is dependent on runtime information, such as the problem size and the underlying ar- chitecture. For this reason, the OpenMP standard includes environment variables to facilitate changing decisions about scheduling and parallelism at runtime. Other researchers have proposed extensions to OpenMP to express architecture specific optimizations, such as data distribution directives for CC-NUMA [6] and software- DSM [15] systems. Such extensions may inhibit portabil- ity, but they can be ignored by systems for which they do not apply. They give the programmer another tool for tun- ing performance without explicitly modifying the applica- tion program. In this spirit, we present extensions and com- piler support needed to exploit slipstream execution mode a new performance enhancement for multiprocessors built from dual-processor CMPs (chip multiprocessors) [9]. Slipstream execution mode is based on the observation that adding more computational resources does not always reduce execution time for a fixed-size problem. As the problem is divided into smaller pieces to increase paral- lelism, communication and synchronization overheads begin to dominate or even overtake the reduced computation time. When this occurs, it may be more effective to apply addi- tional resources to reduce communication overhead, rather than to increase parallelism. Slipstream execution mode considers cache-coherent dis- tributed shared memory (DSM) multiprocessors built from dual-processor CMPs with shared L2 cache, such as the IBM Power-4 CMP [10]. A parallel task is allocated on one pro- cessor of each CMP node. The other processor of each node executes a reduced version of the same task. The re- duced version skips shared-memory stores and synchroniza- tion, allowing it to run ahead of the true task. Even with the skipped operations, the reduced task makes accurate for-