RepFD - Using reputation systems to detect failures in large dynamic networks Maxime Véron * , Olivier Marin *† , Sébastien Monnet * , Pierre Sens * * Sorbonne Universités, UPMC Univ Paris 06, CNRS, INRIA, LIP6 UMR 7606, 4 place Jussieu 75005 Paris. {Firstname.Lastname}@lip6.fr NYU Shanghai. ogm2@nyu.edu Abstract—Failure detection is a crucial service for dependable distributed systems. Traditional failure detector implementations usually target homogeneous and static configurations, as their performance relies heavily on the connectivity of each network node. In this paper we propose a new approach towards the implementation of failure detectors for large and dynamic net- works: we study reputation systems as a means to detect failures. The reputation mechanism allows efficient node cooperation via the sharing of views about other nodes. Our experimental results show that a simple prototype of a reputation-based detection service performs better than other known adaptive failure detectors, with improved flexibility. It can thus be used in a dynamic environment with a large and variable number of nodes. KeywordsFailure detection ; Reputation Systems ; Large scale distributed systems I. I NTRODUCTION Distributed systems should provide reliable and continuous services despite the failures of some of their components. A classical way for a distributed system to tolerate failures is to detect them and then to recover. It is now well recognized that the dominant factor in system unavailability lies in the failure detection phase [1]. As a consequence, failure detection plays a central role in the engineering of such systems. Chandra and Toueg introduced in [2] the notion of unreliable failure detector (FD). An FD is an oracle which provides information about process crashes. It is unreliable as it can make some mistakes for a while; for instance, some live nodes can be considered as having crashed. FDs are used in a wide variety of settings, such as network communication and group membership protocols, computer cluster management and distributed storage systems. Numerous implementations of FDs have been proposed, where each node monitors the state of the others. However, most FD implementations have two severe limitations: they consider all the nodes in a same way, there is no distinction between well and bad behaved nodes; local oracles gather information from the other nodes without any coordination [3], [4], [5]. In stable and homogeneous configurations such as clusters, where nodes of a same type are linked through low latency networks and subject to crash failures at the same rate, these limitations have a low impact on the quality of the failure detection. However, in large and dynamic systems such as gaming platforms or large cloud infrastructures, nodes are very different: some nodes (eg. Server) are powerful and connected to the network with a high speed link whereas some others have a limited power and slow connections. Taking into account such differences is essential for the quality of the detection. Furthermore, in such dynamic environments sharing information on the state of the nodes could greatly increase the global view of the distributed system. If one node has a good connection to the other ones, it can share its view to slowly connected nodes and thus prevent wrong views about failures. In this paper, we propose a new collaborative failure de- tector which exploits information about the behavior of nodes to increase its detection quality both in terms of detection time (completeness) and mistake avoidance (accuracy). To classify the behavior of nodes we rely on a reputation service where nodes periodically exchange heartbeat messages. The reputation of a node dynamically increases if it sends its heartbeat on time, and decreases if some heartbeats get lost or arrive after the expected dates. We conducted an extensive evaluation of our failure de- tection on distributed configurations using real traces to inject failures and message losses. We show that our detector out- performs well-known implementations [4], [6]: it provides a better accuracy while keeping short detection times, especially when the network is subject to message losses. The rest of the paper is organized as follows. Section II presents the reputation system we use to implement our failure detection service and details the detector implementation. Sec- tion III describes two standard failure detector implementations we then compare to our solution in the performance evaluation of Section IV. Finally, Section V explores related work and Section VI concludes the paper. II. DETECTING FAILURES WITH A REPUTATION SYSTEM Our solution uses a distributed reputation system to detect failures. A reputation system [7] aims to collect and compute feedback about node behaviors. Feedback is subjective and obtained from past interactions between nodes, yet gathering feedback about all the interactions associated with one node produces a rather accurate representation of its behavior. In our case, the reputation system focuses on behaviors that fall within the scope of a given failure model. The reputation system we present in this section is basic and aims to reproduce the qualities of a good reputation system according to [8]: fast convergence, precise notation of nodes, resistance to malicious nodes, small overhead, scalability, and adaptivity to peer dynamics. Our reputation system can be used for a wide variety of middlewares and services. In a previous work, we describe in details and use this reputation system to 1