Extending GridSim with an Architecture for Failure Detection Agust´ ın Caminero 1 , Anthony Sulistio 2 , Blanca Caminero 1 , Carmen Carri´ on 1 , and Rajkumar Buyya 2 1 Department of Computing Systems 2 Dept. of Computer Sc. & Software Eng. The University of Castilla La Mancha, Spain The University of Melbourne, Australia {agustin, blanca, carmen}@dsi.uclm.es {anthony, raj}@csse.unimelb.edu.au Abstract Grid technologies are emerging as the next generation of distributed computing, allowing the aggregation of re- sources that are geographically distributed across different locations. However, these resources are independent and managed separately by various organizations with different policies. This will have a major impact to users who submit their jobs to the Grid, as they have to deal with issues such as policy heterogeneity, security and fault tolerance. More- over, the changes of Grid conditions, such as resources that may become unavailable for a period of time due to main- tenance and/or suffer failures, would significantly affect the Quality of Service (QoS) requirements of users. Therefore, it is essential for users to take into account the effects of resource failures during jobs execution. In this paper, we present our work on introducing re- source failures and failure detection into the GridSim sim- ulation toolkit. As we need to conduct repeatable and con- trolled experiments, it is easier to use simulation as a means of studying complex scenarios. We also give a detailed description of the overall design and a use case scenario demonstrating the conditions of resources varied over time. 1 Introduction Grid systems are the next generation of distributed com- puting systems. They are highly heterogeneous environ- ments and consist of a series of independent organizations sharing their resources, creating what is known as Virtual Organization (VO) [5]. In Grid systems, each organization keeps its own independence and autonomy, since it is an essential issue for the creation of a real Grid. Therefore, each organization can decide its own policy and whether to join/leave a VO at any time. Such decisions will certainly affect users in submitting and executing their jobs. Another important issue concerning users is that the number of resources can fluctuate significantly over time. The availability of resources may vary due to changes in their working conditions, such as network congestion, par- tial failures or even the connection or disconnection of com- puting resources. With many resources in a Grid, the re- source or network failures are the rule rather than the ex- ception. Hence, they should be taken into account in order to provide a reliable service [11]. Supporting fault tolerance is one of the main technical challenges in designing Grid environments. This is because production Grid systems must be able to tolerate resource failures, while at the same time effectively exploiting the resources in a scalable and transparent manner. Therefore, in order to cope with these challenges, from the fault toler- ance point of view, the system must have failure detection and recovery schemes. Such a scenario can be described in Figure 1. A failure detector monitors the components of the system (step 1), and notifies a local Grid Information Service (GIS) entity when a failure occurs in any of them (step 2). Then, a recovery scheme is being applied in or- der to restart a failed job on another computational resource dynamically. An example of such a scheme is the failure monitor notifies users that resource Res2 is out of order (step 3). Hence, users can resubmit their jobs to other re- sources. As demonstrated in this example, both detection and recovery schemes must be an integral part of the Grid computing infrastructure [13, 18]. Therefore, it is important to conduct thorough study and evaluation of new reliability models, error detection and recovery techniques before they can be deployed in production Grid environments. To test new detection and recovery schemes in a Grid en- vironment like the above scenario, a lot of work is required to set up the testbeds on many distributed sites. Even if au- tomated tools exist to do this work, it would still be very difficult to produce performance evaluation in a repeatable and controlled manner, due to the inherent heterogeneity of the Grid. In addition, Grid testbeds are limited and creating