Future Generation Computer Systems 23 (2007) 913–919 www.elsevier.com/locate/fgcs FAIL-FCI: Versatile fault injection William Hoarau, S´ ebastien Tixeuil ∗ , Fabien Vauchelles LRI-CNRS 8623 & INRIA Grand Large, France Received 14 April 2006; received in revised form 22 January 2007; accepted 29 January 2007 Available online 11 February 2007 Abstract One of the topics of paramount importance in the development of Grid middleware is the impact of faults, since their probability of occurrence in a Grid infrastructure and in large-scale distributed systems is actually very high. In this paper, we explore the versatility of a new tool for fault injection in distributed applications: FAIL-FCI. In particular, we show that not only are we able to fault-load existing distributed applications (as used in most current papers that address fault-tolerance issues), we are also able to inject qualitative faults, i.e. inject speciﬁc faults at very speciﬁc moments in the program code of the application under test. Finally, and although this was not the primary purpose of the tool, we are also able to inject speciﬁc patterns of workload, in order to stress test the application under test. Interestingly enough, the whole process is driven by a simple uniﬁed description language that is totally independent from the language of the application, so that no code changes or recompilation are needed on the application side. c  2007 Elsevier B.V. All rights reserved. Keywords: Fault-tolerance; Fault-injection; Stress testing; Grid middleware 1. Introduction It is expected that Grid middleware is reliable and provides a comprehensive support for fault-tolerance mechanisms, such as failure-detection, check-pointing recovery, replication, software rejuvenation, and component-based reconﬁguration, among others. One of the techniques for evaluating the effectiveness of those fault-tolerance mechanisms and the reliability level of Grid middleware is to make use of some fault-injection tools and a robustness tester to conduct some experimental assessments of the dependability metrics of the target system. In this paper, we present software that can be used both for software fault-injection and for stress testing of distributed applications, which are the basis for dependability benchmarking in Grid Computing. Some applications (for example peer to peer applications) involve a considerable number of users, e.g. to exchange ﬁles or to execute long calculations (SeTi@Home, Decrypthon, Xtremweb, Boinc, etc.). For those applications, the appearance and disappearance of participating machines are unpredictable, ∗ Corresponding address: Universite Paris Sud, LRI & INRIA, Batiment 490, 91405 Orsay cedex, France. Tel.: +33 1 69 15 42 39; fax: +33 1 69 15 65 86. E-mail addresses: hoarau@lri.fr (W. Hoarau), tixeuil@lri.fr (S. Tixeuil). very frequent, and occur while the application is run eventually. It is particularly difﬁcult to study the functioning of large- scale distributed programs: it would be necessary to have a considerable number of computers and engineering power to execute the software in an actual situation, to measure the performances, or to detect the defects. Testing the validity of fault-tolerant software and measuring the impact on the performance of occurring faults requires being able to control those faults. When an application is run on a cluster, it is likely that machines will run roughly at the same speed (for example a one to ten ratio on the relative speeds of the processors makes it easy to solve the consensus problem), so the considered system is actually synchronous. Afterwards, when the application is then run on a larger scale (e.g. in an Internet-like setting) where the strong synchrony hypothesis does not hold any more, crucial issues related to fault-tolerance and asynchronous settings have been overlooked. 2. Distributed fault-injection 2.1. State of the art The issues in testing component-based distributed systems have already been described, and the methodology for testing 0167-739X/$ - see front matter c  2007 Elsevier B.V. All rights reserved. doi:10.1016/j.future.2007.01.005