1 Automatic Detection of Large Extended Data-Race-Free Regions with Conflict Isolation Alexandra Jimborean, Per Ekemark, Jonatan Waern, Stefanos Kaxiras, and Alberto Ros Abstract—Data-race-free (DRF) parallel programming becomes a standard as newly adopted memory models of mainstream programming languages such as C++ or Java impose data-race-freedom as a requirement. We propose compiler techniques that automatically delineate extended data-race-free (xDRF) regions, namely regions of code that provide the same guarantees as the synchronization-free regions (in the context of DRF codes). xDRF regions stretch across synchronization boundaries, function calls and loop back-edges and preserve the data-race-free semantics, thus increasing the optimization opportunities exposed to the compiler and to the underlying architecture. We further enlarge xDRF regions with a conflict isolation (CI) technique, delineating what we call xDRF-CI regions while preserving the same properties as xDRF regions. Our compiler (1) precisely analyzes the threads’ memory accessing behavior and data sharing in shared-memory, general-purpose parallel applications, (2) isolates data-sharing and (3) marks the limits of xDRF-CI code regions. The contribution of this work consists in a simple but effective method to alleviate the drawbacks of the compiler’s conservative nature in order to be competitive with (and even surpass) an expert in delineating xDRF regions manually. We evaluate the potential of our technique by employing xDRF and xDRF-CI region classification in a state-of-the-art, dual-mode cache coherence protocol. We show that xDRF regions reduce the coherence bookkeeping and enable optimizations for performance (6.4%) and energy efficiency (12.2%) compared to a standard directory-based coherence protocol. Enhancing the xDRF analysis with the conflict isolation technique improves performance by 7.1% and energy efficiency by 15.9%. Index Terms—Compile-time analysis, inter-procedural analysis, inter-thread analysis, data sharing, data races, cache coherence. ✦ 1 I NTRODUCTION P ARALLEL programming languages based on the shared- memory model have well-defined memory consistency mod- els to clarify when data modified by one thread must be visible to other threads. To simplify reasoning about correctness of parallel executions, mainstream languages such as C++ and Java have already adopted data-race-free (DRF) as a standard and provide none or weak guarantees in the presence of data races. For in- stance, C and C++ programs that contain data races have undefined semantics [1], [2], [3]. In contrast, data-race-free codes enable a variety of optimizations based on the fundamental observation that different threads cannot access the same memory location without synchronization, if at least one thread modifies the target variable. In other words, in DRF applications, synchronization-free regions provide the strong guarantee that different threads can- not target concurrently the same memory address. Leveraging this property, recently proposed micro-architectural enhancements relax unnecessarily restrictive constraints, as shown for example in state-of-the-art coherence protocols [4], [5], [6], [7], [8], [9], [10]. These proposals demonstrate that synchronization-free re- gions in DRF applications permit the core to delay the action of publishing the writes, shown in Figure 1.b, leading to significant improvements in performance and energy compared to traditional protocols (Figure 1.a). Similarly, C/C++ compilers and alike typically optimize synchronization-free regions as if the code was sequential, without speculation or costly inter-thread analysis. In this paper we denote synchronization-free regions that are • A. Jimborean, P. Ekemark, J. Waern, and S. Kaxiras are with the De- partment of Information Technology, Uppsala University, 751 05 Uppsala, Sweden. E-mail: alexandra.jimborean@it.uu.se, stefanos.kaxiras@it.uu.se • A. Ros is with the Computer Engineering Department, University of Murcia, 30100 Murcia, Spain. E-mail: aros@ditec.um.es loop 1..N { write write lock ... unlock write } loop 1..N { write write lock ... unlock write } CS loop 1..N { write write lock ... unlock write } (a) Traditional c.c. (b) (c) visible visible visible visible join / barrier visible join / barrier visible (b) Optimized c.c. (c) xDRF c.c. join / barrier xDRF DRF1 DRF2 CS N+1 coherence actions 1 coherence action Fig. 1. (a) A standard cache coherence (c.c.) protocol makes the write operations visible immediately after they have executed, thus performing 3×N actions. (b) Coherence protocols designed for DRF applications delay the action of making write operations visible until the first encountered synchronization point, hence N +1 actions. (c) The xDRF region consists of both DRF1 and DRF2 regions (bypassing CS). An xDRF-aware cache coherence protocol can safely defer the action of publishing writes until the boundary of the xDRF region, thus significantly reducing the number of actions to only one action. not guarded by lock-unlock operations as DRF. Extended data- race-free (xDRF) regions are sets of DRF regions which span across synchronization points (e.g. acquire-release pairs), bypass the synchronized code (i.e. the critical section), while maintaining the DRF semantics [11] across the entire region [12], [13]. For example, in Figure 1.c, the xDRF region consists of the data-race- free regions DRF1 and DRF2, excluding the synchronized code which we denote as enclave non-DRF region (CS). In short, xDRF regions enable optimizations across synchro-