210 IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 51, NO. 1, JANUARY 2005
Coding Approaches to Fault Tolerance in
Linear Dynamic Systems
Christoforos N. Hadjicostis, Member, IEEE, and George C. Verghese, Fellow, IEEE
Abstract—This paper discusses fault tolerance in discrete-time
dynamic systems, such as finite-state controllers or computer
simulations, with focus on the use of coding techniques to ef-
ficiently provide fault tolerance to linear finite-state machines
(LFSMs). Unlike traditional fault tolerance schemes, which
rely heavily—particularly for dynamic systems operating over
extended time horizons—on the assumption that the error-cor-
recting mechanism is fault free, we are interested in the case when
all components of the implementation are fault prone. The paper
starts with a paradigmatic fault tolerance scheme that systemat-
ically adds redundancy into a discrete-time dynamic system in a
way that achieves tolerance to transient faults in both the state
transition and the error-correcting mechanisms. By combining
this methodology with low-complexity error-correcting coding,
we then obtain an efficient way of providing fault tolerance to
identical unreliable LFSMs that operate in parallel on distinct
input sequences. The overall construction requires only a constant
amount of redundant hardware per machine (but sufficiently large
) to achieve an arbitrarily small probability of overall failure
for any prespecified (finite) time interval, leading in this way to a
lower bound on the computational capacity of unreliable LFSMs.
Index Terms—Fault tolerance, linear dynamic systems, linear
finite-state machines (LFSMs), transient faults, unreliable error
correction.
I. INTRODUCTION AND TERMINOLOGY
A
DISCRETE-time dynamic system evolves in time ac-
cording to an internal state that influences its output
and future behavior. Examples of dynamic systems include
finite-state machines (FSMs), digital filters, convolutional
encoders, decoders, and algorithms or simulations running on a
computer architecture over several time steps. In this paper, we
are interested in building reliable dynamic systems exclusively
out of unreliable components, including components in any
error-correcting mechanisms. We initially explore a general
methodology for protecting an arbitrary discrete-time dynamic
system against transient faults in its implementation; once this
paradigm is analyzed, we combine it with coding techniques to
build reliable linear finite-state machines (LFSMs) out of unre-
Manuscript received December 13, 2001; revised September 14, 2004. This
work was supported in part by the National Science Foundation under NSF
CAREER Award 0092696, in part by the Air Force Office of Scientific Re-
search under Award AFOSR DoD F49620-01-1-0365URI, and in part by fellow-
ships from the National Semiconductor Corporation and the Grass Instrument
Company.
C. N. Hadjicostis is with the Department of Electrical and Computer Engi-
neering, University of Illinois at Urbana-Champaign, Urbana, IL 61801-2307
USA.
G. C. Verghese is with the Department of Electrical Engineering and Com-
puter Science, Massachusetts Institute of Technology, Cambridge, MA 02139
USA.
Communicated by R. Urbanke, Associate Editor for Coding Techniques.
Digital Object Identifier 10.1109/TIT.2004.839491
liable components (namely, unreliable XOR gates and voters).
The end result is an efficient fault-tolerance scheme that uses
a constant amount of redundant hardware per machine and is
able to protect multiple unreliable LFSMs that have identical
dynamics but operate on distinct input sequences.
An unreliable component is a component that is subject to
transient faults, i.e., faults that manifest themselves at particular
time steps but do not necessarily persist for later times [1]–[3].
In digital circuits, for example, transient faults could be due to
noise (such as radiation or electromagnetic interference) and
are rapidly becoming a major cause for concern as the device
size decreases.
1
A transient state-transition fault in a dynamic
system is a transient fault that causes a transition to an incor-
rect state. Due to the nature of dynamic systems, the effect of a
transient state-transition fault may last over several time steps;
in fact, state corruption at a particular time step will, in general,
lead to the corruption of the overall behavior and output at fu-
ture time steps.
To realize the severity of the problem of protecting dynamic
systems against transient state transition faults, consider the fol-
lowing toy example. Assume that we have a discrete-time dy-
namic system (e.g., an FSM) in which the probability of making
a transition to an incorrect next state (on any input) is and
is independent between different time steps. Clearly, the prob-
ability that the system follows the correct state trajectory for
consecutive time steps is and goes to zero expo-
nentially with . A common solution to this problem has been
modular redundancy with feedback as shown in Fig. 1. We use
several replicas of the original system, each of which is initial-
ized at the same state and is supplied with the same input (so
that all systems ideally follow the same state trajectory). At the
end of each time step, the voter decides what the correct state
is based on a majority voting rule; the corrected state is then
fed back to all systems.
2
If the voter is fault free, this approach
works nicely: given a target probability of failure for any pre-
specified (finite) time interval, we can always choose an appro-
priate number of system replicas to achieve our objective (i.e.,
to ensure that, with high probability, all replicas are in the cor-
rect state after the voting stage). Furthermore, by increasing the
1
In fact, the primary motivation for this work was the seemingly inherent
unreliability that characterizes emerging technologies, such as submicrometer
designs, single-electron devices, and molecular electronics [4]–[6].
2
There exist many variations of this basic modular redundancy approach. For
example, one can implement bit-wise voting, combine the functionality of the
state transition mechanism and the voter into one combinational circuit, or use
periodic voting (e.g., perform error correction once every , , time steps
as done for example in [7]–[9] for different types of dynamic systems). Our
goal here is not to discuss the advantages and disadvantages of these different
variations but to illustrate the fundamental limitations that we are faced with in
the context of dynamic systems.
0018-9448/$20.00 © 2005 IEEE