M. Bubak et al. (Eds.): ICCS 2004, LNCS 3037, pp. 475–482, 2004. © Springer-Verlag Berlin Heidelberg 2004 An Extended Coherence Protocol for Recoverable DSM Systems with Causal Consistency * Jerzy Brzeziński and Michal Szychowiak Institute of Computing Science Poznań University of Technology Piotrowo 3a, 60-965 Poznań, POLAND phone: +48 61 665 28 09, fax: +48 61 877 15 25 {jbrzezinski,mszychowiak}@cs.put.poznan.pl Abstract. This paper presents a new checkpoint recovery protocol for Distrib- uted Shared Memory (DSM) systems with read-write objects. It is based on in- dependent checkpointing integrated with a coherence protocol for causal con- sistency model. That integration results in high availability of shared objects and ensures fast restoration of consistent state of the DSM in spite of multiple node failures, introducing little overhead. Moreover, in case of network partitioning, the extended protocol ensures that all the processes in majority partition of the DSM system can continuously access all the objects. 1 Introduction One of the most important issues in designing modern Distributed Shared Memory (DSM) systems is fault tolerance, namely recovery, aimed at guaranteeing continuous availability of shared data even in case of failures of some DSM nodes. The recovery techniques developed for general distributed systems suffer from significant overhead when imposed on DSM systems (e.g. [3]). This motivates investigations for new re- covery protocols dedicated for the DSM. Our research aims at constructing a new solution for the DSM recovery problem which would tolerate concurrent failures of multiple nodes or network partitioning. In [2] we have proposed the concept of a co- herence protocol for causal consistency model [1] extended for low cost checkpoint- ing which ensures fast recovery. To the best of our knowledge it is the first check- point-recovery protocol for this consistency model. In this paper we present a formal description of the protocol as well as the proof of its correctness. This paper is organized as follows. In section 1 we define the system model. Sec- tion 3 details a new coherence protocol extended with checkpointing in order to offer high availability and fast recovery of shared data. The protocol is proven correct in section 4. Concluding remarks are given in section 5. * This work has been partially supported by the State Committee for Scientific Research grant no. 7T11C 036 21