1. Introduction and motivation A straightforward approach to recovery from failures in a distributed system is to restore its state from a globally consistent checkpoint. The cost of taking a coordinated checkpoint can be amortized if the checkpoint is taken during system or application induced global synchronization operations. Typical Distributed Shared Memory (DSM) systems like TreadMarks [4] need to periodically run a garbage collector to reclaim memory used by the DSM protocol data structures. Existing fault tolerant implementations of DSM systems [2] rely on global synchronization implicit in the operation of the system, which is used to take coordinated checkpoints. Any previous checkpoints and logs on stable storage can thus be safely discarded and the system does not have to face the problem of management of checkpointed state. Our primary goal is to investigate how to implement fault tolerant DSM using independent checkpointing. There exist DSM protocols like the Home-Based Lazy Release Consistency protocol (HLRC) [1] that do not enforce global operations, and artificially introducing them with fault tolerance support would impact performance. In addition, with uncoordinated checkpointing individual processes can decide on when and what to checkpoint. Finally, taking a coordinated checkpoint may not be always possible, for example in applications that tolerate weakly connected operation. 2. Background A major problem in distributed rollback recovery [3] is enforcing a global consistent cut in the state of individual processes after a failure and restart. One technique is to use coordinated checkpointing to establish a consistent recovery line. The alternative - taking independent checkpoints - suffers from the problem of domino effect, when a single failure may potentially force all processes to rollback. Among advantages of independent checkpointing is that it enables optimizations like memory exclusion [5]. An orthogonal technique is to use checkpoints and log messages received by a process. During re-execution, messages are replayed in the order they were received. Such log-based protocols can ensure that the maximum recoverable state is exactly the state of the system before crash. Pessimistic logging of all messages has this desired feature, while not suffering from the well-known problems of uncoordinated checkpointing. The downside is the high overhead incurred by logging every message. In a DSM, updates to memory pages are encoded as diffs representing the differences between a reference version and a version modified due to writes performed during a given interval, labeled by the logical time at the moment of their generation. Pages are versioned by a vector timestamp that encodes knowledge about the last diffs that were applied from some process. 3. Our approach We want to build a DSM system that can recover from single node failures and retain the advantages of uncoordinated checkpointing and fast log-based recovery. Because every memory access may potentially result in an inter-process interaction, logging overhead is critical. The single node failure assumption allows using efficient sender-based logging in volatile memory. Additionally, we aggressively use protocol knowledge to reduce overhead by selectively logging protocol messages. For example, in a home-based DSM [1] full page transfers need not be logged. Instead, we log messages carrying diffs and checkpoint pages at their home nodes. The recovery procedure uses a checkpointed reference version of a page to which it incrementally applies diff logs to locally construct evolving versions of the page. This procedure is applied both for restoring a page at a home node and for the local replay of accesses during recovery by a non-home node. In the second case, we dynamically reconstruct a minimal LRC-consistent version of the page, instead of the actual page that was received by the process during the failure-free execution. 4. Checkpoint and log management In log-based protocols with independent checkpointing the volatile logs must be saved to stable storage and there Fault Tolerance with Independent Checkpointing in Distributed Shared Memory Florin Sultan, Liviu Iftode Department of Computer Science, Rutgers University, New Jersey {sultan, iftode}@cs.rutgers.edu © 1999, Florin Sultan