Brief Contributions________________________________________________________________________________ Correct and Almost Complete Diagnosis of Processor Grids Stefano Chessa and Piero Maestrini, Senior Member, IEEE AbstractÐA new diagnosis algorithm for square grids is introduced. The algorithm always provides correct diagnosis if the number of faulty processors is below T , a bound with T 2 n 2=3 , which was derived by worst-case analysis. A more effective tool to validate the diagnosis correctness is the syndrome dependent bound T  , with T   T , asserted by the diagnosis algorithm itself for every given diagnosis experiment. Simulation studies provided evidence that the diagnosis is complete or almost complete if the number of faults is below T . The fraction of units which cannot be identified as either faulty or nonfaulty remains relatively small as long as the number of faults is below n=3 and, as long as the number of faults is below n=2, the diagnosis is correct with high probability. Index TermsÐSystem-level diagnosis, PMC model, processor grids, constant- degree diagnosis, diagnosis algorithm. æ 1 INTRODUCTION SYSTEM-LEVEL Diagnosis was introduced by Preparata et al. [15]. Their diagnostic model, called PMC, aims at diagnosing systems composed by a set of units (usually processors) connected by point-to-point links. Every unit may be in one of two states: nonfaulty or faulty. Faults are assumed to be permanent. The system is represented by an undirected graph SG N;G, called the system graph, where nodes correspond to units and edges correspond to links. The number #N 1 of units is denoted n. Diagnosis is based on a suitable set of tests between pairs of interconnected units. Every test involves a testing and a tested unit. The testing unit u provides a test sequence to the tested unit v, which returns an output sequence to u. In turn, unit u compares the actual and the expected output sequences and provides a binary test outcome, defined as 0 if the actual and the expected results match and 1 otherwise. The testing structure of the system is represented by a directed diagnostic graph DG N;G D , where the directed edges in set G D represent the tests. There exists an edge from u i to u j if and only if unit u i tests unit u j . The PMC model assumes that fault-free units are always able to identify faulty units, while the outcomes of tests performed by faulty units are completely unreliable. This invalidation rule is summarized in Table 1. Alternate diagnostic models (for example, [3] and [13]) assume different strategies to perform the tests and/ or different invalidation rules. Given a set N f  N of faulty units (actual fault set), the set of all the test outcomes is called syndrome. The syndrome is collected by an external, reliable diagnoser and decoded by a diagnosis algorithm. In general, the algorithm provides a diagnosis of the system by partitioning set N into subset F of units declared faulty, subset K of units declared nonfaulty, and subset S of suspect units. The diagnosis is said to be correct if F  N f and K  N  N f . It is said to be complete if S ;. The classical approach to system diagnosis [15] relies on a system parameter called diagnosability. The diagnosability of a diagnostic graph DG is the largest integer t such that correct and complete diagnosis (also called one-step diagnosis) is possible without exception for all fault sets with #N f  t. A system with this property is said to be t-diagnosable. The value of t is limited above by the minimum of the node indegrees in the diagnostic graph [15]. A characterization of t-diagnosable systems is given in [9]. A general one-step diagnosis algorithm is reported in [8]. However, this approach is not satisfactory in the case of large systems based on regular or quasi-regular interconnection struc- tures, such as grids, tori, and hypercubes, where the one-step diagnosability is very small as compared to the number of units and, presumably, to the potential number of faults. Such interconnection structures are typical of massive parallel systems and wafer-scale VLSI testing [19], [6] which, on the other hand, appear to be the natural candidates for application of system-level diagnosis. An alternate approach was introduced by Scheinermann [20], who showed that correct and complete diagnosis of systems with n units can be obtained with probability approaching 1 as n !1 if the average number of links per unit is slightly above log n. Blough et al. [4] improved this result by providing an algorithm which achieves a probabilistically correct diagnosis with On  log n test links. Rangarajan and Fussel [16], [17], [18] presented a distributed, comparison-based probabilistic model to assure correct diagnosis with high probability. In [11], LaForge et al. presented a probabilistic algorithm for square grids augmented with additional interconnections in order to increase the number of tests. Set F of units declared faulty by their algorithm is F  N f ; however, it is shown that the fraction of nonfaulty units in F can be kept arbitrarily small with high probability. A related algorithm to diagnose square grids con- nected as tori was published subsequently [10]. Sets F and K determined by the latter algorithm contain, with high probability, a constant fraction (very close to 1) of faulty and fault-free units, respectively. The diagnosis of two-dimensional square grids was also investigated in [21], where an incremental algorithm to achieve a probabilistically correct diagnosis was introduced. An extensive survey of the System-Level Diagnosis problem is provided in [2]. This paper introduces a new diagnosis algorithm for square grids (GDA) which further develops an approach used in [1], dealing with diagnosis of octagonal grids, and in [14], which exploits a comparison model to diagnose grids. GDA correctly identifies a subset F  N of faulty units, provided #N f is less than T , a function with T 2 n 2=3  which has been derived by a worst- case analysis. For any given syndrome , GDA also asserts a syndrome dependent bound T  , with T   T , defined as the minimum number of faults which might lead GDA to an incorrect diagnosis under syndrome . Being far above T in most cases, T  is very effective in diagnosis validation. The average of the syndrome dependent bound has been evaluated by extensive simulation. Although correct, the diagnosis provided by GDA is generally incomplete. However, simulation studies showed that the diag- nosis is complete or almost complete (#S  2) if #N f <T and that the number of units declared suspect is a relatively small fraction of n as long as #N f < n=3. IEEE TRANSACTIONS ON COMPUTERS, VOL. 50, NO. 10, OCTOBER 2001 1095 . The authors are with the Dipartimento di Informatica, UniversitaÁdi Pisa, corso Italia 40, 56125 Pisa, Italy, and the Istituto di Elaborazione del CNR, Area della Ricerca di Pisa, via Moruzzi 1, 56124 Pisa, Italy. E-mail: ste@di.unipi.it. Manuscript received 21 May 1999; revised 21 Mar. 2001; accepted 8 May 2001. For information on obtaining reprints of this article, please send e-mail to: tc@computer.org, and reference IEEECS Log Number 109930. 1. Throughout this paper #X denotes the cardinality of set X, for any X. 0018-9340/01/$10.00 ß 2001 IEEE