Reliability Estimation of Fault-Tolerant Systems: Tools and Techniques Robert Geist, Clemson University Kishor Trivedi, Duke University zyxw A zyxwv power has focused attention on tools and techniques we might use to accu- rately estimate the reliability of a proposed computing system on the basis of models derived from the design of that system. Re- liability modeling of fault-tolerant com- puting systems has become an integral part of the system design process, especially for those systems with life-critical applica- tions such as aircraft and spacecraft flight control. Reliability modeling has also become an important arena in which to view the clas- sic struggle between model accuracy, that is, the extent to which a model of a system faithfully represents the system under study, and model tractability, that is, the extent to which the modeler can extract useful information from the model in a cost-effective manner. Within this arena, certain additional complexity constraints that typically ren- der the classical modeling tools inadequate compound the difficulty in searching for solutions to this trade-off problem. One constraint is the huge disparity in state transition rates. zyxwvutsrqp A rate ratio (largest ratexmallest rate) of 1O’O within a single model is not uncommon, yielding “stiff’ systems of differential, integral, or alge- Comparatively evaluating state-of- the-art tools and techniques helps us estimate the reliability of fault-tolerant computing systems. We consider design limitations, efficiency, and accuracy. fault-tolerant computing systems. Our goal is to consider these tools and tech- niques from both ends of the struggle de- scribed. In particular, we will look closely at design limitations imposed by underly- ing model assumptions, on the one hand, and at the efficiency and accuracy of solu- tion techniques employed, on the other hand. Background Recall that if zyxw X is a random variable that denotes the lifetime or time-to-failure of a computing component or system andX has distribution function zyxw Fdt) zyxwv = P(X I t) (1) braic equations, for which standard nu- merical techniques are largely inadequate. Great progress has been made in recent years on numerical techniques for solving stiff systems,’ but a ratio of 1O’O coupled with a system of size IO5 still appears to be out of reach. Our purpose here is to comparatively evaluate state-of-the-art tools and tech- niques for estimating the reliability of then the reliability of the component or system R, (t) is the probability that the system survives until time t, that is, If R, (t) is differentiable, then the hazard rate or failure rate of the component or system is given by (3) 0018-9162/90/0700-0052$01.00 0 zyxwvu 1990 IEEE COMPUTER 52