Colorama: Architectural Support for Data-Centric Synchronization * Luis Ceze, Pablo Montesinos, Christoph von Praun † and Josep Torrellas University of Illinois at Urbana-Champaign {luisceze, pmontesi, torrellas}@cs.uiuc.edu http://iacoma.cs.uiuc.edu † IBM T.J. Watson Research Center praun@us.ibm.com ABSTRACT With the advent of ubiquitous multi-core architectures, a major challenge is to simplify parallel programming. One way to tame one of the main sources of programming complexity, namely syn- chronization, is transactional memory (TM). However, we argue that TM does not go far enough, since the programmer still needs non- local reasoning to decide where to place transactions in the code. A signiﬁcant improvement to the art is Data-Centric Synchroniza- tion (DCS), where the programmer uses local reasoning to assign synchronization constraints to data. Based on these, the system au- tomatically infers critical sections and inserts synchronization oper- ations. This paper proposes novel architectural support to make DCS feasible, and describes its programming model and interface. The proposal, called Colorama, needs only modest hardware extensions, supports general-purpose, pointer-based languages such as C/C++ and, in our opinion, can substantially simplify the task of writing new parallel programs. 1. Introduction As chip multiprocessors become widespread, there is growing pres- sure to substantially broaden their parallel application base. Unfor- tunately, the vast majority of current application programmers ﬁnd parallel programming too complex. To effectively utilize the upcom- ing hardware, we need major breakthroughs that simplify parallel programming. Developing a parallel application consists of four steps [15]: de- composing the problem, assigning the work to threads, orchestrating the threads, and mapping them to the machine. Orchestration is ar- guably the most challenging step, as it involves synchronizing the threads. It is in this area that innovations to simplify parallel pro- gramming are most urgently sought. One such innovation is Transactional Memory (TM) [1, 7, 10, 16, 18]. In TM, the programmer speciﬁes sequences of operations that should be executed atomically. TM simpliﬁes parallel programming in two ways. First, the programmer does not need to worry about the intricacies of managing locks. Second, he does not need to ﬁne- tune critical sections as much, since concurrency is only limited by dependences — not critical section length. We claim, however, that TM is still complicated: it requires the programmer to reason non-locally. Speciﬁcally, when the program- * This work was supported in part by the National Science Foundation un- der grants EIA-0072102, EIA-0103610, CHE-0121357, and CCR-0325603; DARPA under grant NBCH30390004; DOE under grant B347886; and gifts from IBM and Intel. Luis Ceze was supported by an IBM PhD Fellowship. mer inserts a transaction annotation, he also needs to think about what other parts of the program may be accessing this same or re- lated shared data, and potentially insert transaction annotations there as well. Intuitively, like inserting lock and unlock operations, insert- ing transaction annotations involves taking a code-centric approach. To improve programmability further, we need a data-centric ap- proach [20]. With Data-Centric Synchronization (DCS), the pro- grammer associates synchronization constraints with the program’s data structures. Such constraints indicate which sets of data struc- tures should remain consistent with each other and, therefore, be accessed in the same critical section. From these constraints, the system automatically infers the critical sections and inserts thread synchronization operations in the code. DCS simpliﬁes parallel pro- gramming because the programmer reasons locally, focusing only on what structures should be consistent with each other. Existing DCS proposals [20] take user-provided, data-centric synchronization constraints and decide where to insert critical sec- tions using software-only support. In particular, the compiler needs to analyze all the accesses in the code. This is unrealistic in most C/C++ environments, where pointer aliasing is common and, most importantly, dynamic linking denies the compiler access to the whole program. To make DCS practical, this paper proposes the ﬁrst design for Hardware DCS (H-DCS). Our proposal, called Colorama, relies on two hardware primitives: one that monitors all memory accesses to decide when to start a critical section, and one that ﬂexibly trig- gers the exit of a critical section. Colorama is independent of the underlying synchronization mechanism. In this paper, we present a transaction-based implementation and also discuss the issues that appear in a lock-based implementation. We describe Colorama’s architecture, a simple implementation that extends a Mondrian Memory Protection (MMP) [22] system, its programming model and API, and its capacity to help debug conven- tional codes. We show that Colorama needs few hardware resources and has small overhead. It supports general-purpose, pointer-based languages such as C/C++ and, in our opinion, can substantially sim- plify the task of writing new parallel programs. In the following, Section 2 introduces DCS; Sections 3, 4, 5 and 6 present Colorama’s architecture, implementation, programming en- vironment, and debugging issues respectively; Sections 7 and 8 eval- uate Colorama; and Section 9 discusses related work. 2. Data-Centric Synchronization (DCS) 2.1. Basic Idea In Data-Centric Synchronization (DCS) [20], the programmer asso- ciates synchronization constraints with data structures — typically 1