4 IEEE TRANSACTIONS ON INSTRUMENTATION AND MEASUREMENT, VOL. 62, NO. 1, JANUARY 2013 Master Failure Detection Protocol in Internal Synchronization Environment Andrea Bondavalli, Member, IEEE, Francesco Brancati, Alessandra Flammini, Senior Member, IEEE, and Stefano Rinaldi, Member, IEEE Abstract—During the last decades, the wide advance in the net- working technologies has allowed the development of distributed monitoring and control systems. These systems show advantages compared with centralized solutions: heterogeneous nodes can be easily integrated, new nodes can be easily added to the system, and no single point of failure. For these reasons, distributed systems have been adopted in different fields, such as industrial automa- tion and telecommunication systems. Recently, due to technology improvements, distributed systems are also adopted in the control of power-grid and transport systems, i.e., the so-called large-scale complex critical infrastructures. Given the strict safety, security, reliability, and real-time requirements, using distributed systems for controlling such critical infrastructure demands that adequate mechanisms have to be established to share the same notion of time among the nodes. For this class of systems, a synchronization protocol, such as the IEEE 1588 standard, can be adopted. This type of synchronization protocol was designed to achieve very precise clock synchronization, but it may not be sufficient to ensure safety of the entire system. For example, instability of the local oscillator of a reference node, due to a failure of the node itself or to malicious attacks, could influence the quality of synchronization of all nodes. In recent years, a new software clock, the reliable and self-aware clock (R&SAClock), which is designed to estimate the quality of synchronization through statistical analysis, was devel- oped and tested. This statistical instrument can be used to identify any anomalous conditions with respect to normal behavior. A careful analysis and classification of the main points of failure of IEEE 1588 standard suggests that the reference node, which is called master, is the weak point of the system. For this reason, this paper deals with the detection of faults of the reference node(s) of an of IEEE 1588 setup. This paper describes and evaluates the design of a protocol for timing failure detection for internal synchronization based on a revised version of the R&SAClock software suitably modified to cross-exploit the information on the quality of synchronization among all the nodes of the system. The experimental evaluation of this approach confirms the ca- pability of the synchronization uncertainty, which is provided by Manuscript received November 29, 2011; revised March 16, 2012; accepted March 17, 2012. Date of publication August 15, 2012; date of current version December 12, 2012. This work was supported in part by the Italian Ministry for Education, University, and Research in the framework of the Project of National Research Interest (PRIN) “DOTS-LCCI: Dependable Off-The-Shelf based middleware systems for Large-scale Complex Critical Infrastructures” (dotslcci.prin.dis.unina.it) 2008 and of the PRIN “Methods and tools for the time measurement and for the time synchronization in wireless sensor networks,” N. 2008TK5B55_003. The Associate Editor coordinating the review process for this paper was Dr. Dario Petri. A. Bondavalli and F. Brancati are with the Department of Systems and Informatics, University of Florence, 50134, Firenze, Italy (e-mail: bondavalli@ unifi.it; francesco.brancati@unifi.it). A. Flammini and S. Rinaldi are with the Department of Information En- gineering, University of Brescia, 25123 Brescia, Italy (e-mail: alessandra. flammini@ing.unibs.it; stefano.rinaldi@ing.unibs.it). Color versions of one or more of the figures in this paper are available online at http://ieeexplore.ieee.org. Digital Object Identifier 10.1109/TIM.2012.2209916 R&SAClock, to reveal the anomalous behaviors either of the local node or of the reference node. In fact, it is shown that, through a proper configuration of the parameters of the protocol, the system is able to detect all the failures injected on the master in different experimental conditions and to correctly identify failures on slaves with a probability of 87%. Index Terms—Failure detection, precision time protocol (PTP), reliable and self-aware clock (R&SAClock), synchronization un- certainty, synchronization. I. I NTRODUCTION T HE WIDE diffusion of distributed systems in different fields, such as industrial automation and telecommuni- cation, and their increasingly wide use in large-scale com- plex critical infrastructures (LCCIs) [1], such as transport or power grids [2], plays a key role into several fundamental human activities. This type of system requires not only an accurate synchronization of the nodes in order to assure an adequate quality of the services (QoS) offered but also suit- able algorithms that are able to take runtime decisions on the basis of actual and past behavior of the system in or- der to detect relevant anomalies even in systems that exhibit variable and nonstationary behavior, and may be affected by perturbations [3]. Generally speaking, a reliable source of time is a fundamen- tal requirement for safety critical applications but even more in control infrastructure where most of the actions are time based. For this class of applications, synchronization protocols such as the IEEE 1588 standard, also known as the precision time protocol (PTP) [4], can be adopted. This synchronization protocol was designed to achieve very precise clock synchro- nization in local area networks [6], but it may not be sufficient to ensure safety to the whole system. Several research studies propose to improve the reliability of the IEEE 1588 standard by modifying the strict master–slave architecture [7], [8]. Other studies propose the integration of IEEE 1588 with network redundancy protocols, such as the rapid spanning tree protocol (RSTP) or high-availability seamless redundancy (HSR) [9] to improve the robustness of the synchronization infrastructure [10]. However, several problems are still open. In fact, these solutions work properly in case the reference node stops to send synchronization messages because of the failure of the network or of the node itself. At the present state, no tool is available to identify anomalies of the reference nodes that affect the quality of synchronization of all nodes synchronizing with it [11]. The capability of a system to identify anomalies in the time reference node is fundamental, e.g., in real-time and adaptive 0018-9456/$31.00 © 2012 IEEE