A Generic Framework for Mobile Agent’s Fault Tolerance Bassey E. Isong Department of Computer Science and Info. Systems, University of Venda, Private Bag X5050, Thohoyandou 0950, South Africa. bassey.isong@univen.ac.za Obeten O. Ekabua Department of Computer Science and Info. Systems, University of Venda, Private Bag X5050, Thohoyandou 0950, South Africa. Obeten.ekabua@univen.ac.za Abstract—Mobile agent’s execution are prone to failures originating from bad communication, security attacks, agent server crashes, system resources unavailability, network congestion or even deadlock situations. In such events, mobile agents either get lost or damaged (partially or totally) during execution. Making mobile agents fault tolerant is a measure taken to increase the dependability and reliability of agent- based application. Many approaches have been proposed but majority of the existing mobile agent’s fault tolerance implementations are designed to either tolerate one of the fault classes or two (such as communication, crash and agent software failure) but not all in any situation. This perhaps, makes it impossible to detect and recover from failures of all types. In this paper, based on the analysis of existing fault tolerance approaches, we proposed a generic fault tolerance framework that consists of a monitoring, planning and recovery process execution phases that can help tolerate failures of all type. The framework is validated using existing implementations approaches. Keywords: Mobile Agents, Reliability, Fault Tolerance, Framework, Checkpointing, Replication I. INTRODUCTION Mobile agents are of paramount interest in recent distributed computing trends in both academia and industrial fields. Mobile agents are encapsulated pieces of software containing code and data that are able to migrate from one host to another and perform certain task autonomously [3]. They operate in a distributed computing environment consisting of heterogeneous devices and platforms. It is a technology tends to shift computation towards the data rather than data to the computation [1]. These distinct characteristics made them more flexible in deployment which in turn makes the design, implementation, and maintenance of distributed systems a very easy task [2]. Mobile agent’s technology has been greatly demonstrated in many applications domain such as in network management, telecommunications, e-commerce, information retrieval, mobile/pervasive computing, artificial intelligence, workflow management and internet computing, etc [5]. Research activities on mobile agent technology and its application in recent years have gained considerable momentum but the issues of reliability is still of great concern. Mobile agent’s like any other software systems are not isolated from failure especially in the environments they operates. The exponential growth of distributed heterogeneous environments such as the Internet inherently exposes mobile agent’s execution to adverse condition [3], [4]. Mobile agents may encounter traditional errors that specifically emerge during migration request failure, communication exceptions or security violation [6]. To operate despite these failures and for mobile agent’s technology to gain solid grounds at the heart of our today’s industrial applications, they have to be made reliable enough through fault tolerance. Fault tolerance aims to provide reliable execution of agents or resume service in the face of system failure [2]. In order that fault tolerance in mobile agents accomplished its developmental goals, a reliable execution of mobile agents must adhere to two execution properties; non-blocking and exactly-once execution [7]. Today, several mobile agents’ fault tolerance approaches in variety of mobile agent platforms exist. These approaches employed different mechanisms to provide reliable mobile agents’ execution especially in the failure detection and recovery aspect. Most of the recent approaches are optimizations based on existing mechanisms, some are hybrid-based, while some are exception handling based [8], [9], [10]. For instance, recent fault tolerance approaches mostly rely basically on replication but each approach introduces an optimization to the replication process so as to either gain performance or lower cost of replication. In spite of the numerous approaches, fault tolerance in mobile agent still faced unattended challenges that have impeded its full realization. The fact is that most of the existing mobile agent’s fault tolerances implementations are designed to either tolerate one of the failures (such as communication, crash and agent software failure) or at least two but not all in any situation. This perhaps, makes it difficult if not impossible for mobile agents to detect and recover from failures of all types. This particularly calls for a generic fault tolerance model. In this paper, the authors, based on the analysis of existing fault tolerance approaches proposed a generic fault tolerance framework that consists of a monitoring, planning and recovery process execution phases which can tolerate failures of all type. The framework is validated using existing implementations approaches. II. MOBILE AGENT’S FAULT TOLERANCE The increasing demand for better system performance and dependability of software components are threatened by faults which in turn deteriorate system reliability. Faults bring the normal execution state of a system into error state, which in turn results in system failure [2]. Mobile agents are not secluded from operating in abnormal situations. They have a certain level of exposure to fault since they work in distributed environment over the network and make their ___________________________________ 978-1-4577-0174-0/11/$26.00 ©2011 IEEE 56