Master Failures in the Precision Time Protocol Georg Gaderer 1 , Stefano Rinaldi 2 , and Nikolaus Kerö 3 1 Research Unit for Integrated Sensor Systems, Austrian Academy of Sciences 2 University of Brescia 3 Oregano Systems Design and Consulting GmbH Abstract – If all clocks within a distributed system share the same notion of time, the application domain can gain several advantages. Among those is the possibility to implement real-time behavior, accurate time stamping, and event detection. However, with the wide spread application of clock synchronization another topic has to be taken into consideration: the fault tolerance. The well known clock synchronization protocol IEEE1588 (Precision Time Protocol, PTP), is based on a master/slave principle, which has one severe disadvantage. This disadvantage is the fact that the failure of a master automatically requires the re-election of a new master. The start of a master election based on timeout and thus takes a certain time span during which the clocks are not synchronized and thus running freely. Moreover the usage of a new master also requires new delay measurements, which prolong the time of uncertainty as well. This paper analyzes the results of such a master failure and proposes democratic master groups instead of hot-stand-by masters to overcome this problem by. It is shown by means of simulation that the proposed solution will not deteriorate the accuracy of the slave clocks in case of a master failure. Keywords – Fault tolerance, Clock Synchronization, Computer Networks, IEEE Keywords INTRODUCTION Clocks 1 representing the same notion of time have many advantages in distributed systems, the most obvious being the possibility to set coordinated actions such as synchronized communication. This can be used to establish real-time, which compulsory for TDMA schemes. Another application for synchronized clocks is the identification, ordering, and quantization in terms of timing of events in a distributed system. Again, the applications for this approach are wide spread; one very famous are the LAN eXtensions for Instrumentation (LXI) [1,2], where test and measurement devices are synchronized over Ethernet in order to conveniently setup a-posteriori triggering. Approaches to reach these synchronicity requirements are well observed and take usually advantage of communication networks. Synchronization is done by periodically exchanged messages to align the clock w. r. t. each other. Synchronizing clocks this way is often used, as for example in the internet- standard Network Time Protocol (NTP), or in the more accurate IEEE1588 [2] Precision Time Protocol (PTP) standard. 1 The work presented in this paper is partly funded by the European Fund for Regional Development (EFRE) PTP is based on a master/slave principle in a way that once a master, which has been previously elected synchronizes its slaves via multicast messages 2 . However, for a considerable number of applications even a temporary failure of the clock synchronization is by no means acceptable. The PTP protocol handles recovery from a failure by means of providing the so called best master clock (BMC) algorithm; during this phase all slaves within a synchronization (or multicast) domain remain with free running unsynchronized clocks, yet electing a new master. This paper proposes an approach, where multiple masters are tied together to a so-called mastergroup, where one or more masters may fail without any of the nodes noticing the failure, thus the synchronization accuracy will no be deteriorated. The remainder of this paper is structured as follows: After an analysis of the state of the art, namely the master election process in IEEE1588, the approach to synchronize within the group is elaborated and the proposed system shown within a simulation experiment. Finally a conclusion will round up the paper and give an outlook for future research. STATE OF THE ART State of the art clock synchronization techniques can use two different paradigms: the master/slaved based principle and the democratic approach. The first method elects one dedicated master in order to synchronize all other nodes. In opposite to that, democratic algorithms use the clocks of several nodes, which are then combined to an agreed clock value. Obviously both approaches have advantages and disadvantages; master/slave based clock synchronization is easy to implement and to debug, the dependencies within an environment running such a protocol are simple. On the other hand, democratic approaches are adding a certain degree of complexity to a system but have the advantage that they can offer fault tolerance. Faulty or malfunctioning clocks can be sorted out without sacrificing even short-term accuracy. A. IEEE 1588 The basics of IEEE1588-2002 and version 2008 are well specified of course in the respective standard document [2]. However, secondary literature is available as well, giving an overview [3]. As this paper focuses on the fault case that a 2 Version 2008 of the IEEE1588 standard (approved at the time of submission of this paper) allows synchronization via unicast messages as well. ISPCS 2008 – International IEEE Symposium on Precision Clock Synchronization for Measurement, Control and Communication Ann Arbor, Michigan, September 22–26, 2008 978-1-4244-2275-3/08/$25.00 ©2008 IEEE 59