1 Application-Level Diagnostic and Membership Protocols for Generic Time-Triggered Systems Marco Serafini, P´ eter Bokor, Neeraj Suri TU Darmstadt, Germany, Jonny Vinter SP, Sweden, Astrit Ademaj TU Wien, Austria, Wolfgang Brandst¨ atter Audi, Germany, Fulvio Tagliab` o Centro Ricerche Fiat, Italy, Jens Koch Airbus, Germany Abstract—We present on-line tunable diagnostic and member- ship protocols for generic time-triggered (TT) systems to detect crashes, send/receive omission faults and network partitions. Compared to existing diagnostic and membership protocols for TT systems, our protocols do not rely on the single-fault assumption and also tolerate non fail-silent (Byzantine) faults. They run at the application level and can be added on top of any TT system (possibly as a middleware component) without requiring modifications at the system level. The information on detected faults is accumulated using a penalty/reward algorithm to handle transient faults. After a fault is detected, the likelihood of node isolation can be adapted to different system configu- rations, including configurations where functions with different criticality levels are integrated. All protocols are formally verified using model checking. Using actual automotive and aerospace parameters, we also experimentally demonstrate the transient fault handling capabilities of the protocols. Index Terms— Diagnosis, Membership, Time-Triggered Sys- tems, Transient Faults. I. I NTRODUCTION I N both automotive and aerospace X-by-wire applications, TT platforms such as Flexray [15], TTP/C [21], SAFEbus [18] and TT-Ethernet [19] are increasingly being adopted. Some TT platforms disclaim to provide distributed diagnostic and member- ship services, such as FlexRay, or utilize their specific system- level properties to develop customized solutions, like TTP/C [21] or SAFEbus. Instead, we define on-line diagnostic and member- ship protocols as add-on application level modules that can be integrated as plug-in middleware modules onto any TT system, without (potentially problematic [31]) interferences with other functionalities or other applications. Our protocols (a) only use network-level error detection information that is made available at the application level by TT platforms, (b) do not impose constraints on the scheduling of the system, and (c) have low bandwidth requirements. The protocols can be tuned to meet customized fault coverage and latency requirements. For TT platforms, such as FlexRay and TT-Ethernet, that do not provide a standardized diagnostic or membership protocol, our add-on protocol represents a viable and flexible solution to provide such add-on functionalities. The key purpose of a diagnostic protocol, in particular if this is used for safety critical subsystems, is to identify faulty nodes within a small diagnostic delay. Nonetheless, a diagnostic protocol also needs to consider resource availability and to avoid declaring correct components as faulty in case of transient faults, which are becoming increasingly frequent [10]. An “ideal” diagnostic protocol would exclude only nodes with permanent internal faults. In practice, however, these faults do not always manifest as permanent faults at the interface of the node (e.g. crashes). They can also manifest as multiple, subsequent intermittent faults (e.g. sparse message omissions) which, to external observers, appear similar to transient faults. Our diagnostic protocol uses a penalty/reward (p/r) algo- rithm to distinguish between transient fault and intermittent or permanent faults with stochastically predictable accuracy and coverage [30]. Predictability is provided by a stochastic model that considers faults not only over a single protocol run but over multiple runs (using an extended fault model). Different from existing diagnostic approaches which rely on system-specific heuristics for handling transient faults (e.g. [18], [34]) and like the α-count model of [6], [7], our generic p/r model can be applied and tuned to each specific implementation in a well- defined manner. However, different from α-count, our FDIR (Fault Detection, Isolation and Reconfiguration) model does not assume that the maximum duration of transient faults is bounded and known. It admits closed-form analytical solutions which can be easily evaluated by hand without using modeling tools, and it considers systems running multiple applications with varied criticalities. In this paper, we show for the first time how to integrate a p/r algorithm with an on-line distributed diagnostic protocol. Analogous to diagnosis is the membership problem [17], [2], which consists of identifying the set of nodes (called membership view) that have received the same history of messages. We show that a variant of our protocol can act as a membership service and detect the formation of multiple cliques of receivers with inconsistent information. Similar to diagnosis, membership protocols also need to consider availability. Our membership protocols are the first ones where the ensured consistency degree can be tuned, using the p/r algorithm, to avoid over-reactions to transient faults. The protocol is also extended to detect and tolerate both permanent and transient network partitioning. An important aspect of our diagnostic and membership proto- cols is providing consistent diagnostic and membership informa- tion to all nodes even in presence of worst-case (Byzantine) faults. A common feature of TT systems is that non fail-silent faults at the network level are turned into fail-silent faults. This ensures that correct nodes can still communicate despite the presence of non fail-silent faulty nodes. SAFEbus, for example, uses double- redundant Bus Interface Units to detect and isolate non fail-silent faults [18], whereas TTP/C can use a star network configuration with redundant bus guardians [1]. For this reason, many previous membership protocols for TT systems assume only benign faults (crashes, send and receive omissions) [21], [3], [14]. Although we assume fail-silence at the network level, this does not rule out the presence of non-detected errors at the application level where