Distributed Error Confinement Extended Abstract Yossi Azar Dept. of Computer Science Tel Aviv University Tel Aviv 69978 Israel azar@cs.tau.ac.il Shay Kutten Dept. of Industrial Engineering Technion Haifa 32000 Israel kutten@ie.technion.ac.il Boaz Patt-Shamir HP Cambridge Research Lab One Cambridge Center, Cambridge MA 02142 USA Boaz.PattShamir@HP.com ABSTRACT We initiate the study of error confinement in distributed applica- tions, where the goal is that only nodes that were directly hit by a fault may deviate from their correct external behavior, and only temporarily. The external behavior of all other nodes must remain impeccable, even though their internal state may be affected. Er- ror confinement is impossible if an adversary is allowed to inflict arbitrary transient faults on the system, since the faults might com- pletely wipe out input values. We introduce a new fault tolerance measure we call agility, which quantifies the strength of an algo- rithm that disseminate information, against state corrupting faults. We study the basic problem of broadcast, and propose algorithms that guarantee error confinement with optimal agility to within a constant factor, even in asynchronous networks when the topology is unknown. These algorithms can serve as building blocks in more general reactive systems. Previous results in exploring locality in reactive systems were not error confined, and relied on the assump- tion (not used in current paper) that the errors hitting each node are probabilistic, such that a faulty node itself, or its neighbor, can detect the node faulty. The main algorithm uses the novel core bootstrapping technique, that seems inherent for voting in reactive networks; its analysis leads to an interesting combinatorial problem. The technique and the analysis may be of independent interest 1. INTRODUCTION One key difference between centralized and distributed systems is that in distributed systems, faults may hit only a part of the system. To achieve error confinement we to benefit from the fact that many nodes may have not been hit. This intuition was explored before for the case of non-reactive systems (but error confinement, de- fined below, was not achieved). This becomes harder in distributed On leave from Department of Electrical Engineering, Tel Aviv University, Tel Aviv 69978, Israel. PODC’03, July 17–99, 2003, Boston, Massachusetts, USA. systems that are required to propagate information, e.g. communi- cating new input of one node to remote nodes. The difficulty stems from the fact that the propagation may amplify the effect of a fault by spreading wrong information across the system. This, in effect, causes the receiving nodes to become faulty too. That is, their out- put and messages differ from the case that no faults hit the system. This phenomenon acts against the attempt to benefit from the fact that many of the nodes may not have been hit by the original faults. The main technical contribution of this paper, in Subsections 3.3, 3.4, is an algorithm that prevents this amplification effect. Previous papers avoided this difficulty, and addressed issues that are still to be faced if this difficulty does not exist. For example, in [2] it was assumed that every faulty node can detect itself faulty since it was also assumed that the nature of the faults was proba- bilistic. Thus, a faulty node knows not to spread the faults. The task of that paper was then to recover from the faults fast, if the number of faults was small. (Error confinement, defined below, was not ad- dressed, but is immediate in that model.) A similar principle is used in [26] (there the fault can be detected by a neighbor). Similarly, [16]) avoided the spreading issue by addressing only the state that the propagation is not needed; it handled problems remaining after the information was already somehow spread correctly, and only then faults occurred. Thus, again, the nodes corrupted by faults were not required to spread their values (and the faults). The task remaining was just to hold some “special” kind of a consensus in which each non-faulty node voted the already spread value and the faulty nodes may have voted another. That consensus was “special” in the sense that it was in a self stabilizing model, and in the sense that it was required to be fast when the number of faults was small. Some fault-resilient protocols deal with faults by allowing arbitrary behavior until recovery is complete (intuitively, declaring a tempo- rary “state of emergency”). In papers such as [16, 2, 17, 13] the question is how to shorten this period (if the number of faults is small). The problem addressed in this paper is how to devise sys- tems that keep the faulty information masked from the external user as much as possible, even during the recovery period. Let us be more specific (see Section 2 for formal definitions). We consider the model of a distributed system that executes some reac- tive task, i.e., the environment inputs values at nodes and reads out- puts from nodes. The requirement is specified as a predicate over the sequences of input and output values. We consider transient faults, i.e., faults that eventually leave the system. This is modeled by assuming that a fault may hit a set of nodes by arbitrarily modi-