Multi-Agent Systems and Fault-Tolerance: State of the Art Elements Jean-Pierre Briot 1 , Samir Aknine 1 , Isabelle Alvarez 1 , Zahia Guessoum 12 , Jacques Malenfant 1 , Olivier Marin, Jean-Fran¸cois Perrot 1 , Pierre Sens 1 1 LIP6, Universit´ e Paris 6 - CNRS 104 avenue du Pr´ esident Kennedy, 75016 Paris, France 2 MODECO-CReSTIC, IUT de Reims 51687 Reims Cedex 2, France {Jean-Pierre.Briot, Samir.Aknine, Isabelle.Alvarez, Zahia.Guessoum, Jacques.Malenfant, Olivier.Marin, Jean-Francois.Perrot, Pierre.Sens}@lip6.fr 1 Multi-Agent Systems and Fault-Tolerance Multi-agent systems (MAS) represent an eﬀective approach for designing and constructing collaborative applications, designed as set of autonomous entities, named agents, which interact and coordinate. Some example of references on multi-agent systems are: [1], [2] [3]. Although multi-agent systems provide an eﬀective approach for decentralized applications, they still need to better address the area of fault- tolerance [4]. 1.1 Fault-Tolerance Fault-tolerance is one of the means to increase dependability of applications. (Others are fault prevention, fault removal, and fault forecasting). Fault tolerance means to avoid service failures in the presence of faults. [5]. (See also, e.g., [6] as an interesting alternative deﬁnition of fault-tolerance). The fault-tolerance research community has developed solutions (algorithms and architectures), some more curative e.g., based on exception handling and cooperative recovery [7] (see also the companion bibliography [26]), and some more preventive, notably based on the concept of replication, applied e.g. to data bases. 1.2 Cloning Within the family of reactive multi-agents systems, some applications oﬀer high redundancy. A good example is a system based on the metaphor of ant nests. Unfortunately, we cannot design all applications as reactive multi-agents systems, and moreover we cannot apply such simple redundancy scheme onto more cognitive multi-agents systems as this would cause inconsistencies between copies of a single agent. Work by [8] oﬀers dynamic cloning of speciﬁc agents in multi-agents systems. But their motivation is diﬀerent, the objective being to improve the availability of an agent if it is too congested. They do not consider the recovery of tasks state upon failure. The agents considered implements only stateless sessions, i.e. functional tasks without state, where fault-tolerance could be ensured by simply redoing the aborted tasks. 1.3 Replication As discussed by [9], software replication in distributed environments has some advantages over other fault-tolerance solutions. First and foremost, it provides the groundwork for the shortest recovery delays. Also, generally it is less intrusive with respect to execution time. Finally, it scales much better. Another important advantage, on the design perspective, is that the use of software replication is relatively 1