Interprocedural Strength Reduction of Critical Sections in Explicitly-Parallel Programs Rajkishore Barik Intel Labs Santa Clara, CA, 95054 email: rajkishore.barik@intel.com Jisheng Zhao Rice university Houston, TX, 77005 email: jisheng.zhao@rice.edu Vivek Sarkar Rice university Houston, TX, 77005 email: vsarkar@rice.edu Abstract—In this paper, we introduce novel compiler optimiza- tion techniques to reduce the number of operations performed in critical sections that occur in explicitly-parallel programs. Specif- ically, we focus on three code transformations: 1) Partial Strength Reduction (PSR) of critical sections to replace critical sections by non-critical sections on certain control flow paths; 2) Critical Load Elimination (CLE) to replace memory accesses within a critical section by accesses to scalar temporaries that contain values loaded outside the critical section; and 3) Non-critical Code Motion (NCM) to hoist thread-local computations out of critical sections. The effectiveness of the first two transformations is further increased by interprocedural analysis. The effectiveness of our techniques has been demonstrated for critical section constructs from three different explicitly-parallel programming models — the isolated construct in Habanero Java (HJ), the synchronized construct in standard Java, and transactions in the Java-based Deuce software transactional memory system. We used two SMP platforms (a 16-core Intel Xeon SMP and a 32-Core IBM Power7 SMP) to evaluate our optimizations on 17 explicitly-parallel benchmark programs that span all three models. Our results show that the optimizations introduced in this paper can deliver measurable performance improvements that increase in magnitude when the program is run with a larger number of processor cores. These results underscore the importance of optimizing critical sections, and the fact that the benefits from such optimizations will continue to increase with increasing numbers of cores in future many-core processors. Index Terms—Critical sections; transactions; partial strength reduction; critical load elimination; non-critical code motion; interprocedural optimization. I. I NTRODUCTION It is expected that future computer systems will comprise of massively multicore processors with hundreds of cores per chip. Compiling programs for concurrency, scalability, locality, and energy efficiency on these systems is a major challenge. This paper focuses on compiler techniques to address some of the scalability limitations of applications when executing on many-core systems. According to Amdahl’s law, the speedup of parallel applications is limited by the amount of time spent in the sequential part of the applications. A major source of these sequential bottlenecks in parallel applications can be found in critical sections which are logically executed by at most one thread at a time. These critical sections are most commonly found in explicitly-parallel programs which can be non-deterministic in general. In contrast, automatically parallelized programs are usually deterministic and use critical sections sparingly (often as a surrogate for more efficient mechanisms, e.g., when a critical section is used to implement a parallel reduction). As the number of threads increases, the contention in critical sections increases. It is thus important for both performance and scalability of multi-threaded applications that critical sec- tions be optimized heavily. To this end, recent work [1], [2] has proposed combined software and hardware-based approaches to accelerate critical sections for multi-core architectures. The focus of our paper is on compiler-based approaches to move operations out of critical sections whenever possible, akin to the attention paid to innermost loops by classical optimizing compilers for sequential programs. We refer to such code transformations as “strength reduction” because the overall cost of an operation is reduced when it is performed outside a critical section than within it. As an example, consider memory load operations within a critical section. Such a load operation may cause performance problems in two ways if the load results in a cache miss. The overhead of a cache miss can be expensive, not only because it hampers the performance of a single thread, but also because it delays other threads from entering the critical section. It is thus desirable to move cache misses out of critical sections, to whatever extent possible. In addition, moving loads out of critical sections can reduce cache consistency overheads. Well-known redundancy elimination techniques to help achieve this goal include scalar replacement for load elim- ination [3], [4], [5], [6], [7], redundant memory operation analysis [8], and register promotion [9]. For explicitly-parallel programs, most compilers make conservative assumptions while performing load elimination around critical sections. For example, the load elimination algorithm implemented in the Jikes RVM 3.1 dynamic optimizing compiler [10] conservatively assumes that the boundaries of synchronized blocks kill all objects, thereby prohibiting hoisting of loads on shared objects outside of synchronized blocks. A more recent load elimination algorithm [11] eliminates load operations in the context of async, finish and isolated/atomic constructs found in the Habanero-Java (HJ) [12] and X10 [13] languages. For critical sections (isolated blocks), this algorithm proposed a flow-insensitive and context-insensitive approach that unifies side-effects for all isolated blocks in the entire application. This approach only allows a restricted