Adaptive and Dependable Group Communication Raimundo José de Araújo Macêdo Distributed Systems Laboratory (LaSiD) Computer Science Department Federal University of Bahia Campus de Ondina, Salvador, Bahia, Brazil macedo@ufba.br ABSTRACT Group Communication is a powerful abstraction that is be- ing widely used to manage consistency problems in a vari- ety of distributed system models, ranging from synchronous, to time-free asynchronous model. Though similar in prin- ciples, distinct implementation mechanisms have been em- ployed in the design of group communication for distinct system models. However, the hybrid nature of many mod- ern (real-time) distributed systems, with dynamic and var- ied QoS guarantees, has put forward the need for integrated models. Furthermore, adaptation with degraded service is a common requirement in such scenarios. This paper tackles this new challenge by introducing a generic group communi- cation mechanism called the Timed Causal Blocks. Because of its integrated feature, the Timed Causal Blocks mecha- nism is capable of handling group communication for both synchronous and asynchronous distributed systems, dynam- ically adapting to the available QoS. For example, it can dynamically switch to the asynchronous version when the run-time system can no longer guarantee a timely opera- tion. Formal properties of the integrated model and related mechanisms, with proof sketches are presented. Keywords group communication, fault tolerance, hybrid systems 1. INTRODUCTION Group communication is a powerful abstraction that can be used whenever groups of distributed processes cooperate for the execution of a given task [3, 13, 1, 7]. In group com- munication, processes communicate in a group basis where a message is sent to a group of processes. With a group is usually associated a name to which application processes will refer, making transparent the location of the distributed processes forming the group. Due to the uncertainties inher- ent to distributed systems (emerging from communication or process failures), group communication protocols have to face situations where, for instance, a sender process fails when a multicast is underway or where messages arrive in an inconsistent order at diﬀerent destination processes. More- over, a consistent view of the set of functioning processes is fundamental for some applications. For instance, when server replication is employed to attain high dependability of services, surviving server replicas must take over the re- sponsibilities of failed ones. Therefore, a central problem that has to be solved in such a context is to maintain con- sistent the server replica states and the views that distinct replicas have about the set of functioning ones. Group Com- munication is being widely used to manage such problems in a variety distributed system models, ranging from syn- chronous, to time-free asynchronous model. In synchronous systems, message transmission and process execution delays are bounded, and, in most cases, these bounds are assumed to be known in advance. Synchronous model is natural for hard real-time distributed systems, since it guarantees bounded reaction time for events. This model assumption simpliﬁes the treatment of failures since a pro- cess failing to send a message (or processing it) within the delay bound can be considered to have failed. As a conse- quence, several problems related to fault-tolerant comput- ing, such as membership, consensus, and atomic broadcast have been solved in such a model. The price that has to be paid is the real-time scheduling and hardware redundancy techniques that need to be used to guarantee that such a bounded time assumption is achieved with high probabil- ity [17, 6, 8]. In an asynchronous system, on the other hand, there is no known bound for message transmission or processing times. This makes the system more portable and less sensitive to operational conditions (for example, long unpredictable transmission times will not aﬀect safety properties of the system). However, it is well known that some fundamental fault-tolerant problems have no determin- istic solution in such a model (e.g., the consensus problem [14]). In practice, however, most systems (specially those built from oﬀ-the-shelf components) are neither fully syn- chronous, nor fully asynchronous. Most of the time they behave synchronously, but can have ”unstable”periods dur- ing which they behave in an anarchic way. That is way many researches have successfully identiﬁed distinct stabil- ity conditions necessary to solve fundamental fault-tolerant problems [4, 12]. Other researches have considered hybrid systems composed by a synchronous and an asynchronous part. So, we can regard such systems as being hybrid in the space dimension. This is the case of the TCB model, which relies on a synchronous wormhole to implement fault Technical Report 001/2008. Distributed Systems Laboratory, UFBA. January 2008. Available on http://www.lasid.ufba.br