M. Bubak et al. (Eds.): ICCS 2004, LNCS 3037, pp. 475–482, 2004.
© Springer-Verlag Berlin Heidelberg 2004
An Extended Coherence Protocol for Recoverable DSM
Systems with Causal Consistency
*
Jerzy Brzeziński and Michal Szychowiak
Institute of Computing Science
Poznań University of Technology
Piotrowo 3a, 60-965 Poznań, POLAND
phone: +48 61 665 28 09, fax: +48 61 877 15 25
{jbrzezinski,mszychowiak}@cs.put.poznan.pl
Abstract. This paper presents a new checkpoint recovery protocol for Distrib-
uted Shared Memory (DSM) systems with read-write objects. It is based on in-
dependent checkpointing integrated with a coherence protocol for causal con-
sistency model. That integration results in high availability of shared objects and
ensures fast restoration of consistent state of the DSM in spite of multiple node
failures, introducing little overhead. Moreover, in case of network partitioning,
the extended protocol ensures that all the processes in majority partition of the
DSM system can continuously access all the objects.
1 Introduction
One of the most important issues in designing modern Distributed Shared Memory
(DSM) systems is fault tolerance, namely recovery, aimed at guaranteeing continuous
availability of shared data even in case of failures of some DSM nodes. The recovery
techniques developed for general distributed systems suffer from significant overhead
when imposed on DSM systems (e.g. [3]). This motivates investigations for new re-
covery protocols dedicated for the DSM. Our research aims at constructing a new
solution for the DSM recovery problem which would tolerate concurrent failures of
multiple nodes or network partitioning. In [2] we have proposed the concept of a co-
herence protocol for causal consistency model [1] extended for low cost checkpoint-
ing which ensures fast recovery. To the best of our knowledge it is the first check-
point-recovery protocol for this consistency model. In this paper we present a formal
description of the protocol as well as the proof of its correctness.
This paper is organized as follows. In section 1 we define the system model. Sec-
tion 3 details a new coherence protocol extended with checkpointing in order to offer
high availability and fast recovery of shared data. The protocol is proven correct in
section 4. Concluding remarks are given in section 5.
*
This work has been partially supported by the State Committee for Scientific Research grant
no. 7T11C 036 21