IEEE TRANSACTIONS ON COMPUTERS, VOL. c-23, NO. 7, JULY 1974 Fault-Tolerance of the Iterative Cell Array Switch ror Hybrid Redundancy ROY C. OGUS, STUDENT MEMBER, IEEE Abstract-The technique of hybrid redundancy has been used to protect those portions of a digital system which have to be made ultrareliable. Siewiorek and McCluskey have presented a new switch design for hybrid redundancy which is shown to be of less complexity than other switch designs presented in the literature. The possibility of increasing the overall system reliability is ex- amined, considering schemes which will allow the switch to be tolerant of a certain number of faults in the hardware. The use of fail-safe logic together with a coding scheme has been found to be an effective way to increase the fault-tolerance of the switch. The fault-tolerance could also be achieved through the use of classical protection schemes such as triple modular redundancy (TMR). The two fault-tolerant strategies are compared using reliability, cost, and speed parameters as basis for comparison. A reliability analysis of the various system configurations has indicated that it is advan- tageous to increase the fault-tolerance, as a significant improvement of the overall system reliability is obtained. However, an optimum fault-tolerance exists, beyond which degradation of the system results. The question of whether adding spares to a system will always improve the system reliability is studied and it is found that this is not always the case. Finally, the implementation of error de- tection circuitry using self-checking checkers is discussed, together with the possibility of the construction of a fault-tolerant voter. Index Terms-Computer reliability, error-correcting codes, fail- safe logic, fault-tolerance, hybrid redundancy, iterative cell array, mission time, self-checking checker, threshold voter, triple modular redundancy (TMR). INTRODUCTION THE use of N-modular redundancy (NMR) together with standby sparing has resulted in a very promising technique for protecting those portions of a digital system whose continuous real-time operation is essential. This technique is known as hybrid redundancy [1], [2] and consists of using N identical copies of the original module (where N is odd), forming an NMR core, connected to a voting element which outputs a value corresponding to the majority of the input values. Also provided are a number of standby spare modules which can replace any of the modules in the core in the event of one of the latter failing. The hybrid redundancy scheme is shown in Fig. 1. When- ever a module malfunctions it is replaced by an identical spare unit, the switch being the element which carries out the replacement. Manuscript received August 6, 1973; revised February 15, 1974. This work was supported by the National Science Foundation under Grant GJ-27527, with partial support from the South African Council for Scientific and Industrial Research. The author is with the Digital Systems Laboratory, Stanford University, Stanford, Calif. 94305. In Fig. 1 the NMR core is composed of N (N being odd) modules and there are S standby spares. The modules are connected to the switch and a disagreement detector monitors the outputs of the modules and compares them to the voter output. If the disagreement detector discovers a discrepancy then the switch will remove the disagreeing module from the NMR core and replace it by a spare. It is shown in [1] that this scheme can be made ultra- reliable if the voter, switch, and disagreement detector (VSD) are very reliable. Thus the implementation of a simple switching scheme which can be made very reliable at a reasonable cost is essential. A new switch design has been presented by Siewiorek and McCluskey [3] that is shown to be of considerably less complexity than other switch designs presented elsewhere in the literature. The detailed description and design of the iterative cell switch are given in [3] but the scheme will be briefly described here for completeness. Thereafter, the possibility of increasing the reliability of the VSD will be examined, considering schemes which will allow the VSD to be able to function correctly even in the presence of failures in its hardware. We shall make a distinction between the terms re- liability and fault-tolerance. The reliability of a network (implicitly assumed to be measured at a time T) is the probability that the network has not failed up to time T. The fault-tolerance of a network is the number of faults which can simultaneously occur in the network without causing it to function incorrectly. Thus, the term fault- tolerance is a measure of the ability of a network to per- form correctly even if faults exist in its hardware. A net- work which has a fault-tolerance of zero will be called a simplex or irredundant network. In this paper we shall consider two schemes to increase the fault-tolerance of the simplex VSD proposed in [3] and determine the effect on the reliability; of the overall system. It will be shown that the concept of fail-safe logic (together with a coding scheme) can be used to increase the fault-tolerance of the VSD and will improve the system reliability significantly. A second scheme uses the classical technique of NMR and comparisons will be made between the two fault-tolerant VSD's and the simplex VSD based on reliability, cost, and speed parameters. Finally, the implementation of fault-detection circuitry will be considered so that faults in the VSD may be detected and repaired before the fault-tolerance of the VSD is exceeded. Self-checking techniques can be ef- 667