HARP: Adaptive Abort Recurrence Prediction for Hardware Transactional Memory Adri` a Armejach ∗† Anurag Negi Adri´ an Cristal ∗§ Osman Unsal Per Stenstrom Tim Harris 1 Barcelona Supercomputing Center Chalmers University of Technology Oracle Labs, Cambridge Universitat Polit` ecnica de Catalunya § IIIA - CSIC - Spanish National Research Council Abstract—Hardware Transactional Memory (HTM) exposes parallelism by allowing possibly conflicting sections of code, called transactions, to execute concurrently in multithreaded applications. However, conflicts among concurrent transactions result in wasted computation and expensive rollbacks. Under high contention HTM protocol overheads can, in many cases, amount to several times the useful work done. Blindly scheduling transactions in the presence of contention is therefore clearly suboptimal from a resource utilization standpoint, especially in situations where several scheduling options exist. This paper presents HARP (Hardware Abort Recurrence Predictor), a hardware-only mechanism to avoid speculation when it is likely to fail. Inspired by branch prediction strategies and prior work on contention management and scheduling in HTM, HARP uses past behavior of transactions and locality in conflicting memory references to accurately predict conflicts. The prediction mechanism adapts to variations in workload characteristics and enables better utilization of computational resources. We show that an HTM protocol that integrates HARP exhibits reductions in both wasted execution time and serialization overheads when compared to prior work, leading to a significant increase in throughput (~30%) in both single- application and multi-application scenarios. I. I NTRODUCTION The problem of extracting thread level parallelism through speculative execution has received a lot of attention from both industry and academia [13, 18]. In particular, Hardware Transactional Memory (HTM) [14] offers performance com- parable to fine-grained locks while, simultaneously, enhancing programmer productivity by largely eliminating the burden of managing access to shared data. Recent usability studies sup- port this thesis [8, 19], suggesting that Transactional Memory (TM) can be an important tool for building parallel applica- tions. For these reasons, HTM is getting increasing attention from the industry [9, 10, 11], and IBM has released their first chip with built-in HTM support, the BlueGene/Q [23]. More recently, Intel has published ISA extensions (TSX) that provide support for basic HTM and lock elision, with the intention of supporting these in upcoming products [16]. An HTM system allows concurrent speculative execution of blocks of code, called transactions, that may access and update shared data. However, in the presence of data conflicts transactions may abort, i.e., the results of speculative execution are discarded. This results in wasted work, expensive rollbacks of application state, and inefficient utilization of computational resources. While conflicts due to concurrent accesses to shared data cannot be completely eliminated, mechanisms to avoid starting a transaction when it is likely to fail are necessary for maximizing computational throughput. Moreover, in scenarios where multiple scheduling options are available, having such 1 Work done while at Microsoft Research, Cambridge mechanisms can expose additional parallelism and improve resource utilization. While single application performance is still important, systems where multiple parallel applications coexist are ex- pected to become increasingly common in the near future. The performance of HTM in scenarios with abundant transactional threads is still an open question, and solutions that provide efficient utilization of computational resources and good per- formance are required for TM to gain wide acceptance. In the past, considerable work has been done on contention management, but mostly in the field of Software TM (STM) [1, 12, 20]. These proposals typically react after aborts happen, without trying to avoid future conflicts. Conversely, a few HTM proposals exist that try to avoid execution of possibly conflicting transactions [3, 5, 24]. However, these solutions do not provide full hardware support and rely on expensive and specialized software runtime routines and data structures. Moreover, the efficacy of these proposals in scenarios with multiple concurrently executing applications is unclear. In this paper, we introduce Hardware Abort Recurrence Predictor (HARP), a comprehensive hardware proposal that identifies groups of transactions that are likely to be executed concurrently without conflicts. Our proposal allows other threads or applications to utilize computational resources when the expected duration of contention is long, providing better throughput when running several applications, and potentially higher parallelism when several threads of the same applica- tion are available for scheduling. Moreover, HARP dynam- ically chooses a contention avoidance mechanism based on expected duration of contention, in order to maximize resource utilization, while minimizing the amount of wasted work due to transaction aborts. HARP avoids software overheads by using simple hardware structures to record transactional characteristics. More specifically, we notice strong temporal locality in contended addresses in transactional applications. By detecting when conflicting locations change, we can iden- tify when contention is likely to dissipate. To evaluate HARP, we compare it against “Bloom Filter Guided Transaction Scheduling” (BFGTS) [3], a state-of- the-art transaction scheduling technique, and LogTM [17], a well established HTM design. Our evaluation includes single- application setups, comprising a scenario with the same num- ber of threads as cores, and a scenario with more threads than cores. We provide insights on when using more threads can extract additional parallelism, and show that HARP outper- forms LogTM and BFGTS on average by 109.7% and 30.5% respectively. Moreover, we are the first to study the perfor- mance implications of a transactional multi-application setup where, again, our technique outperforms the other evaluated © 2013 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes,creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works. DOI 10.1109/HiPC.2013.6799100