FailDetect: Gossip-based Failure Estimator for Large-Scale Dynamic Networks Andrei Pruteanu, Venkat Iyer, Stefan Dulman Delft University of Technology, The Netherlands Emails: {a.s.pruteanu,v.g.iyer,s.o.dulman}@tudelft.nl Abstract—Ubiquitous and wirelessly connected devices are the present status quo in terms of networks around us. With the ever increase of scale, there comes also the problem of various transmission failures. They are usually caused by hardware, software, or any other medium access contention. For the case of mobile networks, path uncertainty comes also into picture due to node mobility. All this leads to low quality of service and reduced user experience. The main contribution of the paper is the introduction of a novel distributed algorithm called FailDetect for the statistical estimation of the average packet failure-rate in large-scale wireless distributed systems. It is based on gossip protocols, with the adding of periodic resets of the exchanged values. It is a fully-distributed scheme that does not presume time synchronization among the reset intervals for various nodes. A model and an evaluation by means of simulation and experiments show that FailDetect succeeds in evaluating the average packet failure-rate of the network, while exhibiting low message-complexity. Index Terms—large-scale systems, failures, packet loss, dis- tributed, gossiping, reset I. I NTRODUCTION AND MOTIVATION The omni-presence of wirelessly connected devices around us is no longer a prediction about the future, but rather a fact about today’s technological status quo. Along with the ubiquitous, always-connected experience, comes also the problem of transmission failures due to noise in the wireless environment or various hardware or software problems. The wireless communication environments are inherently different than the wired-based ones. For the majority of the devices (equipped with omnidirectional antennas), every transmission is a broadcast. Additionally, nodes have to share a limited part of the wireless spectrum. Due to contention, radio propagation issues (multi-path effects [8] etc), the chances of having transmission failures are high [25]. Failure detection is one of the most important building blocks of most distributed systems applications such as trans- actions [12], consensus [5] and replication services [19]. In systems where synchronization is hard to achieve (such as MANETs), the presence of a failure detection service may be used to improve various agreement problems [9]. Inspired by real-world deployments of WSNs, where peri- odic resets of the nodes are a known failure mode [3], we propose a new failure detection algorithm, by incorporating resets into a gossiping algorithm. We show that the new mech- anism, called DiffusionReset, retains the property of achieving convergence exponentially fast. Although our extension is derived from gossiping algorithms that are sensitive to mass conservation [16], [17], our approach specifically exploits the property that total mass varies in a dynamic network. Based on DiffusionReset, we develop the FailDetect algo- rithm as a solution for the online fail rate estimation within the network (defined as the percentage of packets that are lost within a defined period of time). We are not assuming that the nodes advertise their packet transmission success rate. In short, random subsets of nodes reset the local values used by the algorithm. The results of the gossiping algorithm is an average aggregate value, available at all nodes. The deviation from the expected estimate (given there is no message loss), indicates the amount of transmission failures in the system. To the best of our knowledge, this is the first work that is addressing an arbitrary mobile multihop topology while still offering very good fail rate estimates in a fully distributed manner and with low message complexity. We validate our work with a model checked by both simulation and experiments on our wireless testbed. For the analysis of our algorithm via simulation we considered dif- ferent mobility and network density scenarios that cannot be matched with corresponding traces from real deployments due to their scarcity and difficulty of collection, especially for large-scale mobile ad-hoc networks. The paper is structured as follows: in Section II we describe existing state-of-the-art. In Section III we introduce the failure detection mechanism. We analyze the prposed FailDetect algorithm in Section IV, and conclude the paper in Section V. II. RELATED WORK Due to the detrimental nature of wireless communication failures, there are many studies dedicated to the detection [18], [20], [25], the impact estimation [11], [23], [24], the mitigation [13] and the repair [1] of communication links [14] to restore the system to a normal state of functioning. In wireless sensor networks (WSNs), devices are usually capable of providing two pieces of information about the channel quality - the Link Quality Indication (LQI) and Received Signal Strength Indication (RSSI). They constitute a form of Channel State Information (CSI) [4], [14]. While each device is able to estimate the quality of its links, having a global view of the average packet loss across the system, in a fully distributed manner has not been extensively studied. The traditional approach to failure detection is to have each node send heartbeat broadcast messages at regular time intervals. Other nodes (such as link-neighbors for the case