Modeling the reliability of a group membership protocol for dual-scheduled time division multiple access networks ☆ Valério Rosset a, 1 , Pedro F. Souto b, ⁎, Paulo Portugal a , Francisco Vasques c a Departmento de Engenharia Electrotécnica e de Computadores, Faculdade de Engenharia, Universidade do Porto, Rua Dr. Roberto Frias s/n, 4200-465 Porto, Portugal b Departmento de Engenharia Informática, Faculdade de Engenharia, Universidade do Porto, Rua Dr. Roberto Frias s/n, 4200-465 Porto, Portugal c Departmento de Engenharia Mecânica, Faculdade de Engenharia, Universidade do Porto, Rua Dr. Roberto Frias s/n, 4200-465 Porto, Portugal abstract article info Article history: Received 27 May 2009 Accepted 12 October 2011 Available online 20 October 2011 Keywords: Distributed computing Reliability Fault-tolerance FlexRay Group membership We present reliability models for a group membership protocol designed for TDMA networks such as FlexRay, a protocol that is likely to become the de facto standard for next generation automotive networks. The models are based on discrete-time Markov chains and consider a comprehensive set of fault scenarios. Furthermore, they are parametric allowing for a sensitivity analysis. The results, obtained by a numeric solution of the models using the PRISM model-checker, show that they are computationally practical for realistic conﬁgura- tions and that the GMP can achieve reliability levels in the range required for safety critical applications. © 2011 Elsevier B.V. All rights reserved. 1. Introduction Fieldbus-based communication systems play a prominent role in several application domains, ranging from manufacturing and process industries to automotive and avionics systems. They represent the backbone of modern networked control systems, providing a commu- nication infrastructure that supports control, supervision, monitoring and diagnosis applications [1]. Many of these application domains impose stringent dependabili- ty requirements since a failure of the control system may have impor- tant consequences on the system under control or its environment, such as economic losses or severe human injury. Typical examples are safety-critical applications for automotive x-by-wire systems [2] and industrial machinery [3]. Therefore, in these systems it is neces- sary to include fault tolerance mechanisms in order to prevent possi- ble faults from causing a system failure. Recently, a new class of time division multiple access (TDMA) con- trol protocols, which we call dual scheduled TDMA (DuST), has emerged as an important solution to provide fault-tolerant communications for safety-critical applications. Some examples [4] of these solutions are: TT-CAN [5], FTT-CAN [6,7] and, most impor- tantly, FlexRay [8], which is expected to become the de facto standard for next generation network for automotive applications. It is widely accepted [9] that services such as group membership and distributed agreement make the systematic development of safety-critical applications easier. Because these services are used as building blocks of safety-critical applications, it is important to ensure their correctness, ideally through a formal proof. However, any proof relies on fault assumptions and it ensures that the service behaves correctly only as long as the fault assumptions hold true. Given that applications in the automotive domain are mission- oriented, the reliability of these services is the most important dependability attribute. In this paper, we present models for the eval- uation of the reliability of a group membership protocol (GMP) [10] by equating its reliability to the reliability of the assumptions made in its proof, i.e. the probability of these assumptions being true. Because the GMP is executed repeatedly and its fault assumptions are stated on a per-execution basis, we present discrete-time Markov chains (DTMC) models whose time step is closely related to the GMP execution. The fault models considered are rather comprehensive and include both permanent and transient faults affecting both nodes and communication channels. In addition to standard message loss that affects a single message and is equally perceived by all good nodes, we consider two other types of faults: 1) strictly omission asymmetric communication faults [11] in which some good nodes Computer Standards & Interfaces 34 (2012) 281–291 ☆ This work was partially supported by the Portuguese Fundação para a Ciência under Project PTDC/EIA74313/2006 and by Scholarship BD 19302/2004. ⁎ Corresponding author. Tel.: + 351 22 5081855; fax: + 351 22 557 4103. E-mail addresses: vrosset@unifesp.br (V. Rosset), pfs@fe.up.pt (P.F. Souto), pportugal@fe.up.pt (P. Portugal), vasques@fe.up.pt (F. Vasques). 1 This author current afﬁliation is Instituto de Ciência e Tecnologia, Universidade Federal de São Paulo (UNIFESP), Rua Talim, 330 - Vila Nair, São José dos Campos, SP, Brasil. 0920-5489/$ – see front matter © 2011 Elsevier B.V. All rights reserved. doi:10.1016/j.csi.2011.10.004 Contents lists available at SciVerse ScienceDirect Computer Standards & Interfaces journal homepage: www.elsevier.com/locate/csi