July 22, 2009 1:53 WSPC/INSTRUCTION FILE DuartePozoNassu International Journal on Artiﬁcial Intelligence Tools c  World Scientiﬁc Publishing Company Fault Diagnosis of Multiprocessor Systems Based on Genetic and Estimation of Distribution Algorithms: A Performance Evaluation Elias P. Duarte Jr. Aurora T. R. Pozo Bogdan T. Nassu Department of Computer Science - Federal Univesity of Parana P.O. Box 19018 Curitiba, PR Brazil {elias, aurora, bogdan}@inf.ufpr.br Received (12 July 2008) Revised (20 July 2009) Accepted (Day Month Year) As faults are unavoidable in large scale multiprocessor systems, it is important to be able to determine which units of the system are working and which are faulty. System- level diagnosis is a long-standing realistic approach to detect faults in multiprocessor systems. Diagnosis is based on the results of tests executed on the system units. In this work we evaluate the performance of evolutionary algorithms applied to the diagnosis problem. Experimental results are presented for both the traditional genetic algorithm (GA) and specialized versions of the GA. We then propose and evaluate specialized ver- sions of Estimation of Distribution Algorithms (EDA) for system-level diagnosis: the compact GA and Population-Based Incremental Learning both with and without nega- tive examples. The evaluation was performed using four metrics: the average number of generations needed to ﬁnd the solution, the average ﬁtness after up to 500 generations, the percentage of tests that got to the optimal solution and the average time until the solution was found. An analysis of experimental results shows that more sophisticated algorithms converge faster to the optimal solution. Keywords : System-Level Diagnosis; Evolutionary Algorithms; Multiprocessor Systems 1. Introduction Large computer systems that rely on multiple processors to achieve their goals are increasingly popular. It is well-known that given a large enough time interval, pro- cessors will fail. In order to be fault-tolerant 31 , a large system consisting of several processors must employ eﬃcient fault detection strategies in order to be able to deliver its expected service even when some processors become faulty. System-level diagnosis consists in determining which units of a system are faulty and which are fault-free 32,30 . Based on this information reconﬁguration actions can be executed in order to keep the system available. The classical model for system-level diagnosis is the PMC model 26 , named after the authors’ initials: Preparata, Metze and Chien. In this model, a system is deﬁned as a set of heterogeneous units. Each unit in the system can be in one of two states: faulty or fault-free. Given a fault situation, the subset of permanently faulty units is called the fault set. 1