A Fault Tolerance Approach Based on Reinforcement Learning in the Context of Autonomic Opportunistic Grids Alcilene Dalília de Sousa and Luciano Reis Coutinho Departamento de Informática Universidade Federal do Maranhão, UFMA São Luis, Maranhão - Brasil e-mail: alcileneluzsousa@gmail.com / lrc@deinf.ufma.br Abstract — Fault tolerance is a longstanding problem. Two basic solutions are replication and checkpointing, both with their pros and cons. In this paper, we put forward an approach to balance replication and checkpointing in order to provide fault tolerance in opportunistic grid computing systems. We try to retain the benefits of both techniques, while avoiding their downsides. The approach combines reinforcement learning with the MAPE-K architecture for autonomic computing. To validate our proposal, we have performed experiments based simulation using the Autonomic Grid Simulator Tool (AGST). We report promising results. We show that the proposed approach is able to learn suitable switching thresholds between checkpointing and replication. The suitability is verified by comparing the average completion time and the success rate of applications of our proposal against the values from other approaches in the literature. Keywords - fault tolerance; grid computing; opportunistic grids; autonomic computing; reinforcement learning. I. INTRODUCTION Achieving a high processing rate by dividing computational tasks among several geographically distributed machines is, in essence, the core idea behind the computational model called Grid Computing [9]. In this context, it was introduced the concept of opportunistic grids that, potentially, gather thousands of resources, services and applications to provide greater computational power at a lower cost [12]. On the one hand, this type of computational grid promotes the use of non-dedicated resources, such as desktop workstations situated in different administrative domains, by using their idle processing power. On the other hand, there is the challenge of providing services in a dynamic execution environment, where heterogeneous nodes can enter and leave the grid at any time. In this case, it is important to be able of effectively monitoring the grid composition in order to detect and react to these events in a timely manner. Grid computing encompasses various technical challenges. One of them, especially in the context of opportunistic grids, is how to provide fault tolerance in an inherently dynamical environment, an environment in which computing nodes are heterogeneous and can become unavailable at any time. Fault tolerance is the ability of a system to continue to work even in the presence of faults [7]. We say that a fault has occurred when one of system components fails or malfunction, leading to a behavior not in accordance with the system specifications. Concerned with this problem, researchers have been seeking for solutions. Among these, we found some approaches based on the idea of autonomic computing. Broadly, the idea of autonomic computing consists in modeling and building computing systems that have the ability of self-management and self- adaptation to unpredictable changes [6][8][10]. Applied to the problem of fault tolerance in opportunistic grids, autonomic computing has given rise to approaches where the grid middleware tries to automatically adjust parameters or to dynamically combine traditional fault tolerance techniques such as checkpointing and replication [2][15][16][18]. In general, these autonomic approaches provide important gains w.r.t. the traditional fault tolerance techniques. Despite these gains, there are some opportunities for improvement. In this paper, we focus on the following issue regarding the approach presented in [15] [16] (and based on [2]). When should we switch from checkpointing to replication, and vice-versa, given the current workload of an opportunistic grid system? Viana [15] fixes a threshold below which the grid performs replication and above which it performs checkpoint. Our hypothesis is that this threshold can vary, and that this variation is beneficial to the fault tolerance strategy of the grid system. It can lead to fewer delays in the execution time of the application due to fault tolerance concerns. To test this hypothesis, we put forward an adaptive approach to the problem of balancing checkpointing and replication in the context of opportunistic grids. Our approach is based on the use of reinforcement learning. Reinforcement learning is a paradigm of machine learning based on trial-and-error and delayed reward [1][13]. A typical reinforcement learning problem consists in finding a policy (a decision function) that maps environmental states to actions. Such a policy is optimal when it can be used to guide our behavior through the environment in such a way that we obtain maximum cumulative reward over time. In the case of the problem of balancing checkpointing and replication, we want an optimal (or near optimal) policy that help us to decide, directly or indirectly, when it is the right time to switch from checkpointing to replication, and vice- versa. By choosing reinforcement learning we are following the path of several researchers in the areas of Grid, Cloud and Autonomic Computing [3][4][14][17][19]. 11 Copyright (c) IARIA, 2014. ISBN: 978-1-61208-331-5 ICAS 2014 : The Tenth International Conference on Autonomic and Autonomous Systems