Algorithmica (1997) 18: 486–511 Algorithmica © 1997 Springer-Verlag New York Inc. Wait-Free Clock Synchronization 1 S. Dolev 2 and J. L. Welch 2 Abstract. Multiprocessor computer systems are becoming increasingly important as vehicles for solving computationally expensive problems. Synchronization among the processors is achieved with a variety of clock configurations. A new notion of fault-tolerance for clock synchronization algorithms is defined, tailored to the requirements and failure patterns of shared memory multiprocessors. Algorithms in this class can tolerate any number of napping processors, where a napping processor can fail by repeatedly ceasing operation for an arbitrary time interval and then resume operation without necessarily recognizing that a fault has occurred. These algorithms guarantee that, for some fixed k , once a processor P has been working correctly for at least k time, then as long as P continues to work correctly, (1) P does not adjust its clock, and (2) P ’s clock agrees with the clock of every other processor that has also been working correctly for at least k time. Because a working processor must synchronize in a fixed amount of time regardless of the actions of the other processors, these algorithms are called wait-free. Another useful type of fault-tolerance is called self-stabilization: starting with an arbitrary state of the system, a self-stabilizing algorithm eventually reaches a point after which it correctly performs its task. Two wait-free clock synchronization algorithms are presented for a model with global clock pulses. The first one is self-stabilizing; the second one is not but it converges more quickly than the first one. The self- stabilizing algorithm requires each processor’s communication register contents to be a part of the processor’s state. This last requirement is proven necessary. A wait-free clock synchronization algorithm is also presented for a model with local clock pulses. This algorithm is not self-stabilizing. Key Words. Distributed computing, Algorithms, Wait-free, Self-stabilization, Clock synchronization. 1. Introduction. Multiprocessor computers are being designed with ever-increasing numbers of processors. These multiprocessors can be used to solve problems that demand high computation power, such as grand challenge computing problems, which previously were not efficiently solvable. However, in order to take full advantage of multiprocessors, it is vital that they be made fault-tolerant. Fault-tolerance is necessary in order to provide even the same level of availability that is provided by uniprocessors, since the probability of a crash in a multiprocessor system increases with the number of processors. Clever fault-tolerance schemes may also be able to provide a higher level of availability, by continuing ongoing computations even if a large number of processors fail. A central issue for any multiprocessor system is the synchronization among proces- sors. The common synchronization component used in multiprocessors is a clock. There are several ways to implement a clock in multiprocessor systems: (1) provide a com- mon clock that is connected to all the processors in the system, (2) provide a common 1 This work was supported by NSF Presidential Young Investigator Award CCR-9396098 and Texas A&M University Engineering Excellence funds. A preliminary version of this work was presented at the 12th ACM Symposium on Principles of Distributed Computing, August 1993 [DW]. 2 Department of Computer Science, Texas A&M University, College Station, TX 77843, USA. {shlomi, welch}@cs.tamu.edu. Received December 20, 1993; revised January 1995. Communicated by G. N. Frederickson.