210 IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 51, NO. 1, JANUARY 2005 Coding Approaches to Fault Tolerance in Linear Dynamic Systems Christoforos N. Hadjicostis, Member, IEEE, and George C. Verghese, Fellow, IEEE Abstract—This paper discusses fault tolerance in discrete-time dynamic systems, such as finite-state controllers or computer simulations, with focus on the use of coding techniques to ef- ficiently provide fault tolerance to linear finite-state machines (LFSMs). Unlike traditional fault tolerance schemes, which rely heavily—particularly for dynamic systems operating over extended time horizons—on the assumption that the error-cor- recting mechanism is fault free, we are interested in the case when all components of the implementation are fault prone. The paper starts with a paradigmatic fault tolerance scheme that systemat- ically adds redundancy into a discrete-time dynamic system in a way that achieves tolerance to transient faults in both the state transition and the error-correcting mechanisms. By combining this methodology with low-complexity error-correcting coding, we then obtain an efficient way of providing fault tolerance to identical unreliable LFSMs that operate in parallel on distinct input sequences. The overall construction requires only a constant amount of redundant hardware per machine (but sufficiently large ) to achieve an arbitrarily small probability of overall failure for any prespecified (finite) time interval, leading in this way to a lower bound on the computational capacity of unreliable LFSMs. Index Terms—Fault tolerance, linear dynamic systems, linear finite-state machines (LFSMs), transient faults, unreliable error correction. I. INTRODUCTION AND TERMINOLOGY A DISCRETE-time dynamic system evolves in time ac- cording to an internal state that influences its output and future behavior. Examples of dynamic systems include finite-state machines (FSMs), digital filters, convolutional encoders, decoders, and algorithms or simulations running on a computer architecture over several time steps. In this paper, we are interested in building reliable dynamic systems exclusively out of unreliable components, including components in any error-correcting mechanisms. We initially explore a general methodology for protecting an arbitrary discrete-time dynamic system against transient faults in its implementation; once this paradigm is analyzed, we combine it with coding techniques to build reliable linear finite-state machines (LFSMs) out of unre- Manuscript received December 13, 2001; revised September 14, 2004. This work was supported in part by the National Science Foundation under NSF CAREER Award 0092696, in part by the Air Force Office of Scientific Re- search under Award AFOSR DoD F49620-01-1-0365URI, and in part by fellow- ships from the National Semiconductor Corporation and the Grass Instrument Company. C. N. Hadjicostis is with the Department of Electrical and Computer Engi- neering, University of Illinois at Urbana-Champaign, Urbana, IL 61801-2307 USA. G. C. Verghese is with the Department of Electrical Engineering and Com- puter Science, Massachusetts Institute of Technology, Cambridge, MA 02139 USA. Communicated by R. Urbanke, Associate Editor for Coding Techniques. Digital Object Identifier 10.1109/TIT.2004.839491 liable components (namely, unreliable XOR gates and voters). The end result is an efficient fault-tolerance scheme that uses a constant amount of redundant hardware per machine and is able to protect multiple unreliable LFSMs that have identical dynamics but operate on distinct input sequences. An unreliable component is a component that is subject to transient faults, i.e., faults that manifest themselves at particular time steps but do not necessarily persist for later times [1]–[3]. In digital circuits, for example, transient faults could be due to noise (such as radiation or electromagnetic interference) and are rapidly becoming a major cause for concern as the device size decreases. 1 A transient state-transition fault in a dynamic system is a transient fault that causes a transition to an incor- rect state. Due to the nature of dynamic systems, the effect of a transient state-transition fault may last over several time steps; in fact, state corruption at a particular time step will, in general, lead to the corruption of the overall behavior and output at fu- ture time steps. To realize the severity of the problem of protecting dynamic systems against transient state transition faults, consider the fol- lowing toy example. Assume that we have a discrete-time dy- namic system (e.g., an FSM) in which the probability of making a transition to an incorrect next state (on any input) is and is independent between different time steps. Clearly, the prob- ability that the system follows the correct state trajectory for consecutive time steps is and goes to zero expo- nentially with . A common solution to this problem has been modular redundancy with feedback as shown in Fig. 1. We use several replicas of the original system, each of which is initial- ized at the same state and is supplied with the same input (so that all systems ideally follow the same state trajectory). At the end of each time step, the voter decides what the correct state is based on a majority voting rule; the corrected state is then fed back to all systems. 2 If the voter is fault free, this approach works nicely: given a target probability of failure for any pre- specified (finite) time interval, we can always choose an appro- priate number of system replicas to achieve our objective (i.e., to ensure that, with high probability, all replicas are in the cor- rect state after the voting stage). Furthermore, by increasing the 1 In fact, the primary motivation for this work was the seemingly inherent unreliability that characterizes emerging technologies, such as submicrometer designs, single-electron devices, and molecular electronics [4]–[6]. 2 There exist many variations of this basic modular redundancy approach. For example, one can implement bit-wise voting, combine the functionality of the state transition mechanism and the voter into one combinational circuit, or use periodic voting (e.g., perform error correction once every , , time steps as done for example in [7]–[9] for different types of dynamic systems). Our goal here is not to discuss the advantages and disadvantages of these different variations but to illustrate the fundamental limitations that we are faced with in the context of dynamic systems. 0018-9448/$20.00 © 2005 IEEE