Membership and System Diagnosis Matti A. Hiltunen Department of Computer Science University of Arizona Tucson, AZ 85721, USA Abstract A membership service is a service in a distributed sys- tem that maintains and provides information about which sites are functioning and which have failed at any given time. System diagnosis, on the other hand, is a method for detecting faulty processing elements and distributing this information to non-faultyelements. In spite of the apparent similarity of goals, these two fields have been considered separately from their beginnings. In this paper, we attempt to compare these fields and show the fundamental differ- ences and the similarities. We demonstrate that the prob- lems are closely related, with the major differences being the assumptions made about the failure model, the testing methods, and the type of service guarantees provided to the application. Furthermore, we demonstrate that the fields are closely enough related that some algorithms utilized in one field can easily be transformed into algorithms in the other. As examples, we derive new membership algorithms from a distributed system diagnosis algorithm and new sys- tem diagnosis algorithms from membership algorithms. 1 Introduction For any computing system consisting of two or more processors, keeping track of which processors are function- ing correctly and which ones have failed is of fundamental importance. The problem becomes especially important with multiprocessors consisting of hundreds, maybe thou- sands, of processors, or with distributed systems consisting of tens, hundreds, or thousands of computers separated by possibly long physical distances. This problem was first tackled in [37], which initiated the field of system diagno- sis. This paper stated that a system operating in a tightly or loosely coupled distributed environment must avoid giving tasks to or using results from faulty processing elements. Therefore, it is necessary for a central authority, or for ev- ery processing element, to be aware of the condition of all the active processing elements. This ability to agree on the state of the system allows the fault-free processors to make correct and consistent progress. A model was presented where each subunit is able to test other subunits. Each test involves the controlled application of stimuli and the observation of the corresponding response. On the basis of the responses, the outcome of the test is classified as “pass” or “fail”. In either case, the testing unit evaluates the tested unit as either fault-free or faulty. Numerous papers on sys- tem diagnosis have followed — examples of recent work This work supported in part by the Office of Naval Research under grant N00014-91-J-1015. can be found in [4, 5, 6, 7, 12, 14, 30, 36, 45]. The group membership problem is the problem of keep- ing track of which processes of a distributed computa- tion are functioning and which have failed at any given time. This principle of treating a group of processes as a single entity in order to provide fault-tolerant ser- vices was introduced as part of the state machine ap- proach [42], and a number of distributed systems like the Isis system [8] provide software support for imple- mentation. Like diagnosis, membership has been exten- sively studied and a number of papers have been pub- lished proposing different algorithms for solving the prob- lem [2, 3, 17, 20, 21, 26, 31, 34, 38, 40, 41, 44]. Despite the above efforts, little has been done to com- pare or contrast these two fields. An important exception is [4], which reviewed the field of system diagnosis thor- oughly and compared it to membership. Unfortunately the emphasis was heavily on system diagnosis and the compar- ison brief. A number of other papers, for example [19, 26], acknowledge the relationship between these problems but do not explore it any further. The purpose of this paper is to contrast these two problems and show that they can be viewed as essentially the same problem with slightly different assumptions. Given this observation, we con- clude that the choice of service to keep track of functional and faulty processes, processors, or computers — whether called membership or system diagnosis — should be based only on user requirements and assumptions made about the computing environment, especially the failure models and synchrony assumptions. Finally, the observations are applied by transforming two typical distributed system di- agnosis algorithms into new membership algorithms, and by transforming a family of membership algorithms into system diagnosis algorithms that provide a service stronger than any other of which we are aware. 2 Background 2.1 Membership Problem The membership algorithms in the literature can be clas- sified by what assumptions they make about the comput- ing environment, what type of service they provide, and what type of algorithm they employ in order to implement the properties of the service. Perhaps the most impor- tant distinction is whether the underlying communication is synchronous or asynchronous. A number of membership algorithms have been designed for synchronous systems where bounds are placed on network transmission time [17, 20, 26, 31, 44], and for asynchronous systems where