A Scalable and Efﬁcient Self-Organizing Failure Detector for Grid Applications Yuuki Horita University of Tokyo horita@logos.ic.i.u-tokyo.ac.jp Kenjiro Taura University of Tokyo / JST tau@logos.ic.i.u-tokyo.ac.jp Takashi Chikayama University of Tokyo chikayama@logos.ic.i.u-tokyo.ac.jp Abstract— Failure detection and group membership manage- ment are basic building blocks for self-repairing systems in distributed environments, which need to be scalable, reliable, and efﬁcient in practice. As available resources become larger in size and more widely distributed, it is more essential that they can be easily used with a small amount of manual conﬁguration in Grid environments, where connectivities between different networks may be limited by ﬁrewalls and NATs. In this paper, we present a scalable failure detection protocol that self-organizes in Grid environments. Our failure detectors autonomously create dispersed monitoring relationships among participating processes with almost no manual conﬁguration so that each process will be monitored by a small number of other processes, and quickly disseminate notiﬁcations along the monitoring relationships when failures are detected. With simulations and real experiments, we showed that our failure detector has a practical scalability, a high reliability, and a good efﬁciency. The overhead with 313 processes was at most 2-percent even when the heartbeat interval was set to 0.1 second, and accordingly smaller when it was longer. I. I NTRODUCTION Failure detection and group membership management are basic components to support distributed applications that can autonomously recover from faults (crash) of participating pro- cesses. With available Grid resources becoming larger in size and more widely distributed, efﬁcient and scalable systems for failure detection and group membership management are becoming more important. In today’s practice, such systems are used both for parallel programming libraries such as PVM [1] and MPI [2], [3] and resource monitoring services [4], [5]. Desirable features of such systems include low overhead, (semi-)automatic self-organization, absence of a single point of failure, scalability, detection accuracy and speed. Many of existing implemented systems either use a simple all-to-one or all-to-all heartbeating schemes that lack scalability [2], [3] or require complex manual conﬁguration [4], [5]. Our system described herein is based on recent advances in scalable failure detectors and group membership protocols [6], [7] and addresses several issues of practical importance. Algorithms described in the literature of this category rarely come with evaluation of real implementation, and often make simplifying assumptions that become issues in practice. For example, they typically assume messages (connections) are never blocked between live nodes. This is not true in the presence of ﬁrewalls and NAT routers, both of which are common in real Grid environments. For another example, they often ignore the fact that in practice, sending a message to a node for the ﬁrst time generally involves establishing a TCP connection, which is much more costly than sending subsequent messages over established connections. TCP is necessary, again because UDP is more often blocked between administrative domains than TCP is. See discussion in Section IV for some other reasons. Therefore practical systems need to pay attention not only to the total trafﬁc per unit time, but also to how many nodes each node should ever send a message to. Systems based on simple gossiping are not enough in this respect [8]. This paper presents a scheme that overcomes these issues and reports both simulation and real experimental results. Our system can be concisely described as follows. • It follows the basic idea of [6] for failure detection, in which each node is monitored by a randomly chosen small number of nodes (typically, 4 or 5). • A node is monitored by TCP connections and heartbeats on them. A process crash is detected by a connection reset, and a machine crash by absence of heartbeats. • Once process or machine faults are detected, notiﬁcations are quickly disseminated along the established TCP con- nections by simple ﬂooding. • It works with almost no manual conﬁguration about network connectivity (ﬁrewalls and NATs). Overall, our system will be useful as a library for support- ing parallel applications that tolerate process/machine crashes and/or use dynamically changing set of resources [9]. This is the primary intended purpose of this work. We however believe it will also be useful as a basic building block for ﬂexible and easy-to-use resource monitoring services. The remainder of this paper is as follows. Section II discusses what failure detection should sufﬁce to help fault- tolerant distributed applications and reviews the existing tech- niques in Section III. We propose our basic protocol for local-area networks in Section IV, and extend it for wide- area networks in Section V. We state simulation evaluation and experimental results in Section VI. Finally, Section VII presents summary and future work. II. BACKGROUND We consider processes communicating with each other via point-to-point asynchronous messages. We do not assume the underlying communication layer’s support for multicast. Each Grid Computing Workshop 2005 202 0-7803-9493-3/05/$20.00  2005 IEEE