4 IEEE TRANSACTIONS ON INSTRUMENTATION AND MEASUREMENT, VOL. 62, NO. 1, JANUARY 2013
Master Failure Detection Protocol in Internal
Synchronization Environment
Andrea Bondavalli, Member, IEEE, Francesco Brancati,
Alessandra Flammini, Senior Member, IEEE, and Stefano Rinaldi, Member, IEEE
Abstract—During the last decades, the wide advance in the net-
working technologies has allowed the development of distributed
monitoring and control systems. These systems show advantages
compared with centralized solutions: heterogeneous nodes can be
easily integrated, new nodes can be easily added to the system, and
no single point of failure. For these reasons, distributed systems
have been adopted in different fields, such as industrial automa-
tion and telecommunication systems. Recently, due to technology
improvements, distributed systems are also adopted in the control
of power-grid and transport systems, i.e., the so-called large-scale
complex critical infrastructures. Given the strict safety, security,
reliability, and real-time requirements, using distributed systems
for controlling such critical infrastructure demands that adequate
mechanisms have to be established to share the same notion of
time among the nodes. For this class of systems, a synchronization
protocol, such as the IEEE 1588 standard, can be adopted. This
type of synchronization protocol was designed to achieve very
precise clock synchronization, but it may not be sufficient to ensure
safety of the entire system. For example, instability of the local
oscillator of a reference node, due to a failure of the node itself or
to malicious attacks, could influence the quality of synchronization
of all nodes. In recent years, a new software clock, the reliable and
self-aware clock (R&SAClock), which is designed to estimate the
quality of synchronization through statistical analysis, was devel-
oped and tested. This statistical instrument can be used to identify
any anomalous conditions with respect to normal behavior. A
careful analysis and classification of the main points of failure of
IEEE 1588 standard suggests that the reference node, which is
called master, is the weak point of the system. For this reason, this
paper deals with the detection of faults of the reference node(s)
of an of IEEE 1588 setup. This paper describes and evaluates
the design of a protocol for timing failure detection for internal
synchronization based on a revised version of the R&SAClock
software suitably modified to cross-exploit the information on the
quality of synchronization among all the nodes of the system.
The experimental evaluation of this approach confirms the ca-
pability of the synchronization uncertainty, which is provided by
Manuscript received November 29, 2011; revised March 16, 2012; accepted
March 17, 2012. Date of publication August 15, 2012; date of current version
December 12, 2012. This work was supported in part by the Italian Ministry
for Education, University, and Research in the framework of the Project of
National Research Interest (PRIN) “DOTS-LCCI: Dependable Off-The-Shelf
based middleware systems for Large-scale Complex Critical Infrastructures”
(dotslcci.prin.dis.unina.it) 2008 and of the PRIN “Methods and tools for
the time measurement and for the time synchronization in wireless sensor
networks,” N. 2008TK5B55_003. The Associate Editor coordinating the review
process for this paper was Dr. Dario Petri.
A. Bondavalli and F. Brancati are with the Department of Systems and
Informatics, University of Florence, 50134, Firenze, Italy (e-mail: bondavalli@
unifi.it; francesco.brancati@unifi.it).
A. Flammini and S. Rinaldi are with the Department of Information En-
gineering, University of Brescia, 25123 Brescia, Italy (e-mail: alessandra.
flammini@ing.unibs.it; stefano.rinaldi@ing.unibs.it).
Color versions of one or more of the figures in this paper are available online
at http://ieeexplore.ieee.org.
Digital Object Identifier 10.1109/TIM.2012.2209916
R&SAClock, to reveal the anomalous behaviors either of the local
node or of the reference node. In fact, it is shown that, through a
proper configuration of the parameters of the protocol, the system
is able to detect all the failures injected on the master in different
experimental conditions and to correctly identify failures on slaves
with a probability of 87%.
Index Terms—Failure detection, precision time protocol (PTP),
reliable and self-aware clock (R&SAClock), synchronization un-
certainty, synchronization.
I. I NTRODUCTION
T
HE WIDE diffusion of distributed systems in different
fields, such as industrial automation and telecommuni-
cation, and their increasingly wide use in large-scale com-
plex critical infrastructures (LCCIs) [1], such as transport or
power grids [2], plays a key role into several fundamental
human activities. This type of system requires not only an
accurate synchronization of the nodes in order to assure an
adequate quality of the services (QoS) offered but also suit-
able algorithms that are able to take runtime decisions on
the basis of actual and past behavior of the system in or-
der to detect relevant anomalies even in systems that exhibit
variable and nonstationary behavior, and may be affected by
perturbations [3].
Generally speaking, a reliable source of time is a fundamen-
tal requirement for safety critical applications but even more
in control infrastructure where most of the actions are time
based. For this class of applications, synchronization protocols
such as the IEEE 1588 standard, also known as the precision
time protocol (PTP) [4], can be adopted. This synchronization
protocol was designed to achieve very precise clock synchro-
nization in local area networks [6], but it may not be sufficient
to ensure safety to the whole system. Several research studies
propose to improve the reliability of the IEEE 1588 standard by
modifying the strict master–slave architecture [7], [8]. Other
studies propose the integration of IEEE 1588 with network
redundancy protocols, such as the rapid spanning tree protocol
(RSTP) or high-availability seamless redundancy (HSR) [9] to
improve the robustness of the synchronization infrastructure
[10]. However, several problems are still open. In fact, these
solutions work properly in case the reference node stops to send
synchronization messages because of the failure of the network
or of the node itself. At the present state, no tool is available to
identify anomalies of the reference nodes that affect the quality
of synchronization of all nodes synchronizing with it [11].
The capability of a system to identify anomalies in the time
reference node is fundamental, e.g., in real-time and adaptive
0018-9456/$31.00 © 2012 IEEE