Stabilizers: A Safe Lightweight Checkpointing Abstraction for Concurrent Programs Lukasz Ziarek Philip Schatz Suresh Jagannathan Department of Computer Science Purdue University {lziarek,schatzp,suresh}@cs.purdue.edu Abstract A checkpoint is a mechanism that allows program execution to be restarted from a previously saved state. Checkpoints can be used in conjunction with exception handling abstractions to recover from exceptional or erroneous events, to support debugging or replay mechanisms, or to facilitate algorithms that rely on speculative evaluation. While relatively straightforward to describe in a se- quential setting, for example through the capture and application of continuations, it is less clear how to ascribe a meaningful se- mantics for safe checkpoints in the presence of concurrency. For a thread to correctly resume execution from a saved checkpoint, it must ensure that all other threads which have witnessed its un- wanted effects after the establishment of the checkpoint are also reverted to a meaningful earlier state. If this is not done, data in- consistencies and other undesirable behavior may result. However, automatically determining what constitutes a consistent global state is not straightforward since thread interactions are a dynamic prop- erty of the program; requiring applications to specify such states explicitly is not pragmatic. In this paper, we present a safe and efficient on-the-fly check- pointing mechanism for concurrent programs. We introduce a new linguistic abstraction called stabilizers that permits the specifi- cation of per-thread checkpoints and the restoration of globally consistent checkpoints. Global checkpoints are computed through lightweight monitoring of communication events among threads (e.g. message-passing operations or updates to shared variables). Our implementation results show that the memory and computa- tion overheads for using stabilizers average roughly 4 to 6% on our benchmark suite, leading us to conclude that stabilizers are a viable mechanism for defining restorable state in concurrent programs. Keywords: Concurrent programming, checkpointing, consis- tency, rollback, continuations, exception handling, message-passing, shared memory. Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Copyright c ACM [to be supplied]. . . $5.00. 1. Introduction Checkpointing mechanisms allow applications to preserve and re- store state. Checkpoints have obvious utility for error recovery [29], program replay and debugging [30, 39]; they can be used to sup- port applications that engage in transactional behavior [18, 20, 41], speculative execution [26] or persistence [12, 34]; and, they can be used to build exception handlers that restore memory to a previous state [36]. In functional languages, continuations provide a simple checkpointing facility: defining a checkpoint corresponds to cap- turing a continuation [38], and restoring a checkpoint corresponds to invoking this continuation. In the presence of references, a sen- sible checkpoint state would also need to store the current values of these references along with the continuation. Unfortunately, defining and manipulating checkpoints becomes significantly more complex in the presence of concurrency. A thread that wishes to establish a checkpoint can simply save its local state, but of course there is no guarantee that the global state of the program will be consistent if control ever reverts back to this point. For example, suppose a communication event via message- passing occurs between two threads and the sender subsequently rolls back control to a local checkpointed state established prior to the communication. A spurious unhandled execution of the (re)sent message may result because the receiver has no knowledge that a rollback of the sender has occurred, and thus has no need to ex- pect retransmission of a previously executed message. In general, the problem of computing a sensible checkpoint requires comput- ing the transitive closure of dependencies manifest among threads from the time the checkpoint is established to the time it is invoked. A simple remedy would require the state of all active threads to be simultaneously recorded whenever any thread establishes a check- point. While this solution is sound, it can lead to substantial ineffi- ciencies and complexity. For instance, if a thread wishes to revert its state to a previously established checkpoint, threads which have witnessed its affects after the checkpoint must be unrolled. How- ever, there may be other threads unaffected by the checkpointed thread’s actions. A scheme that fails to recognize this distinction would be overly conservative in its treatment of rollback, and would be inefficient in practice, especially if checkpoints are restored of- ten. Existing checkpoint approaches can be classified into four broad categories: (a) schemes that require applications to provide their own specialized checkpoint and recovery mechanisms [5, 6]; (b) schemes in which the compiler determines where checkpoints can be safely inserted [4]; (c) checkpoint strategies that require oper- ating system or hardware monitoring of thread state [9, 22, 25]; and (d) library implementations that capture and restore state [13]. Checkpointing functionality provided by an application or a li- brary relies on the programmer to define meaningful checkpoints. 1 2006/1/17