An Approach to the Synchronization of Backup Masters in Dynamic Master-Slave Systems Ernesto Martins, Joaquim Ferreira, Luís Almeida, Paulo Pedreiras, José A. Fonseca DET – IEETA, Universidade de Aveiro Aveiro, Portugal evm@det.ua.pt, jjcf@alunos.det.ua.pt, lda@det.ua.pt, pedreiras@alunos.det.ua.pt, jaf@det.ua.pt Abstract This paper considers the case in which master-slave fieldbus networks are used in safety-critical embedded applications, such as transportation systems. The communication in these networks is controlled by the master that contains a cyclic traffic dispatching table. Master replication is used to achieve fault-tolerance. Traditional approaches to system design, also due to fault- tolerance reasons, have considered static tables, only. However, there is a growing demand for flexibility, mainly to improve the efficiency in using system resources. This calls for the replacement of such static tables with dynamic tables, containing the current communication requirements, and for on-line traffic scheduling. This paper considers such dynamic master-slave architectures and addresses the problem of synchronizing the active and backup masters. In particular, the master node uses a scheduling co-processor to speed up the traffic on-line scheduling and schedulability analysis, as well as to achieve synchronization in a short period of time. 1. Introduction Many safety-critical embedded systems used today, e.g. in transportation systems, are distributed and rely on a fieldbus network that interconnects sensors, actuators and controllers in a reliable and timely way. One popular network access control paradigm that is used in many of these applications is the master-slave paradigm, in which a single node controls the traffic on the bus using a cyclic traffic dispatching table. Several examples can be pointed out, such as WorldFIP [5] and TCN [6], largely used in train control systems, as well as MIL-STD-1553B [4], which has been used for a long time in the USA, for distributed hard-real time applications. Master replication is, of course, essential to avoid the single point-of-failure and achieve fault-tolerance. Master-slave networks are naturally synchronized with the master since this node explicitly tells each slave when to transmit. The main advantage of this type of networks when it comes to increasing operational flexibility, to allow on-line changes to traffic parameters, is that such parameters are concentrated on the master node, only. This simplifies the admission control of change requests and reduces the respective reaction time with respect to distributed alternatives. With this type of flexibility, one can turn on and off the transmission of message streams, or vary the respective transmission rates, according to the run-time needs of the system. This results in a higher efficiency of network utilization, freeing bandwidth that can be used to serve more streams or to facilitate error recovery. However, when replicating the master node, the fact that table contents can vary on-line increases the difficulty in assuring the synchronization between active and backup masters. In this paper, a synchronization mechanism is shown, which improves previous work [7] by taking advantage of a scheduling co-processor (MESSAgE) [2] that assists the master nodes. This mechanism is applied to Controller Area Network (CAN) using FTT-CAN [9]. 2. Problem statement Master-slave protocols are considered as exhibiting a single point-of-failure. This, of course, is no longer true when the master is replicated so that upon failure of the active master, a backup enters into action within a sufficiently short interval. It is necessary, however, that the masters fail in a silent way and that they are synchronized with respect to the scanning of the traffic dispatching tables, which are normally organized as a succession of so called micro-cycles. In our work, the static traffic dispatching table is replaced by a dynamic table containing the communication requirements, i.e. properties of the message streams such as period, phasing, transmission time. This table is then scanned on-line by a traffic scheduler. In this case, there is no longer the concept of a cycle count relative to a fixed referential such as the top of a dispatch table. Thus, other mechanisms must be used to enforce synchronization between active and backup masters. Furthermore, there is now a problem of coherency between the multiple instances of the dynamic table. There maybe change requests that, due to asynchronous start/restart of a master or omission communication faults, are taken by the active master but not by one or more backups, or vice-versa. When this happens, it is important to reestablish coherency and synchronization among all masters as fast as possible. This is the specific problem addressed in this paper. Further work is being carried out at the level of the protocol between the client that issues the change request and the masters, i.e. servers. Particularly, it is important to deal with the incoherencies that might arise from omissions in the change requests so that such situations are detected