1. Introduction and motivation
A straightforward approach to recovery from failures
in a distributed system is to restore its state from a
globally consistent checkpoint. The cost of taking a
coordinated checkpoint can be amortized if the checkpoint
is taken during system or application induced global
synchronization operations. Typical Distributed Shared
Memory (DSM) systems like TreadMarks [4] need to
periodically run a garbage collector to reclaim memory
used by the DSM protocol data structures. Existing fault
tolerant implementations of DSM systems [2] rely on
global synchronization implicit in the operation of the
system, which is used to take coordinated checkpoints.
Any previous checkpoints and logs on stable storage can
thus be safely discarded and the system does not have to
face the problem of management of checkpointed state.
Our primary goal is to investigate how to implement
fault tolerant DSM using independent checkpointing.
There exist DSM protocols like the Home-Based Lazy
Release Consistency protocol (HLRC) [1] that do not
enforce global operations, and artificially introducing
them with fault tolerance support would impact
performance. In addition, with uncoordinated
checkpointing individual processes can decide on when
and what to checkpoint. Finally, taking a coordinated
checkpoint may not be always possible, for example in
applications that tolerate weakly connected operation.
2. Background
A major problem in distributed rollback recovery [3] is
enforcing a global consistent cut in the state of individual
processes after a failure and restart. One technique is to
use coordinated checkpointing to establish a consistent
recovery line. The alternative - taking independent
checkpoints - suffers from the problem of domino effect,
when a single failure may potentially force all processes to
rollback. Among advantages of independent
checkpointing is that it enables optimizations like memory
exclusion [5]. An orthogonal technique is to use
checkpoints and log messages received by a process.
During re-execution, messages are replayed in the order
they were received. Such log-based protocols can ensure
that the maximum recoverable state is exactly the state of
the system before crash. Pessimistic logging of all
messages has this desired feature, while not suffering from
the well-known problems of uncoordinated checkpointing.
The downside is the high overhead incurred by logging
every message.
In a DSM, updates to memory pages are encoded as
diffs representing the differences between a reference
version and a version modified due to writes performed
during a given interval, labeled by the logical time at the
moment of their generation. Pages are versioned by a
vector timestamp that encodes knowledge about the last
diffs that were applied from some process.
3. Our approach
We want to build a DSM system that can recover from
single node failures and retain the advantages of
uncoordinated checkpointing and fast log-based recovery.
Because every memory access may potentially result in an
inter-process interaction, logging overhead is critical. The
single node failure assumption allows using efficient
sender-based logging in volatile memory. Additionally,
we aggressively use protocol knowledge to reduce
overhead by selectively logging protocol messages.
For example, in a home-based DSM [1] full page
transfers need not be logged. Instead, we log messages
carrying diffs and checkpoint pages at their home nodes.
The recovery procedure uses a checkpointed reference
version of a page to which it incrementally applies diff
logs to locally construct evolving versions of the page.
This procedure is applied both for restoring a page at a
home node and for the local replay of accesses during
recovery by a non-home node. In the second case, we
dynamically reconstruct a minimal LRC-consistent
version of the page, instead of the actual page that was
received by the process during the failure-free execution.
4. Checkpoint and log management
In log-based protocols with independent checkpointing
the volatile logs must be saved to stable storage and there
Fault Tolerance with Independent Checkpointing in Distributed Shared Memory
Florin Sultan, Liviu Iftode
Department of Computer Science, Rutgers University, New Jersey
{sultan, iftode}@cs.rutgers.edu
© 1999, Florin Sultan