U.P.B. Sci. Bull., Series C, Vol. 73, Iss. 1, 2011 ISSN 1454-234x A SIMULATION MODEL FOR FAULT TOLERANCE EVALUATION Adrian BOTEANU 1 , Ciprian DOBRE 2 Această lucrare prezintă un model de simulare pentru evaluarea soluţiilor de asigurare a toleranţei la defecte în sistemele distribuite de mari dimensiuni. Modelul extinde simulatorul MONARC prin adăugarea de noi funcţionalităţi pentru evaluarea toleranţei la defecte. Modelul descrie defecte ce pot apărea în astfel de sisteme şi include mecanisme pentru detecţia si corecţia acestora. În cadrul lucrării este prezentată şi o implementare pilot a modelului, împreună cu rezultatele testelor de evaluare. Au fost implementate atât defecte permanente cât şi tranziente ce pot apărea în cazul unităţilor de procesare, componentelor de reţea sau a bazelor de date. Modelul poate fi uşor extins, permiţând adăugarea de noi clase de defecte i tehnologii aferente, în funcţie de experimentul vizat. Modelul poate fi folosit pentru evaluarea performanţelor unor soluţii de toleranţă la defecte pentru sisteme distribuite, pretându-se identificării rapide a punctelor sau ariilor vulnerabile din sistemul simulat. In this paper we present a simulation model designed to evaluate fault tolerance solutions in large scale distributed systems. This model extends the MONARC simulation model with new capabilities for fault tolerance simulation. The model includes failure behavior and capabilities to detect and react to faults. We also present an implementation of this model in MONARC, together with specific evaluation results. The model's implementation considers permanent and transient failures occurring within processing units, network components, as well as databases. The model is easily extendable, allowing the additions of new failure models as required by user experiments. The model can be used in conjunction with key performance metrics, being able to easily pinpoint the likely points or areas of failures in the simulated environments. Keywords: fault tolerance, distributed systems, performance analysis, simulation model, faults 1. Introduction Modeling and simulation were seen for a long time as viable solutions to develop new algorithms and technologies and to enable the enhancement of large- scale distributed systems, where analytical validations are prohibited by the scale 1 Eng., Faculty of Automatics and Computer Science, University POLITEHNICA of Bucharest, Romania, e-mail: adiboteanu@gmail.com 2 Lect., Faculty of Automatics and Computer Science, University POLITEHNICA of Bucharest, Romania, e-mail: ciprian.dobre@cs.pub.ro