Automatic Instruction-Level Software-Only Recovery
Jonathan Chang George A. Reis David I. August
Departments of Electrical Engineering and Computer Science
Princeton University
Princeton, NJ 08544
{jcone,gareis,august}@princeton.edu
Abstract
As chip densities and clock rates increase, processors are
becoming more susceptible to transient faults that can affect
program correctness. Computer architects have typically ad-
dressed reliability issues by adding redundant hardware, but
these techniques are often too expensive to be used widely.
Software-only reliability techniques have shown promise in
their ability to protect against soft-errors without any hard-
ware overhead. However, existing low-level software-only
fault tolerance techniques have only addressed the problem
of detecting faults, leaving recovery largely unaddressed. In
this paper, we present the concept, implementation, and eval-
uation of automatic, instruction-level, software-only recov-
ery techniques, as well as various specific techniques rep-
resenting different trade-offs between reliability and perfor-
mance. Our evaluation shows that these techniques fulfill the
promises of instruction-level, software-only fault tolerance by
offering a wide range of flexible recovery options.
1 Introduction
In recent decades, microprocessor performance has been
increasing exponentially, due in large part to smaller and
faster transistors enabled by improved fabrication technology.
While such transistors yield performance enhancements, their
lower threshold voltages and tighter noise margins make them
less reliable [3, 14, 26], rendering processors that use them
more susceptible to transient faults. Transient faults are in-
termittent faults caused by external events, such as energetic
particles striking the chip, that do not cause permanent dam-
age, but may result in incorrect program execution by altering
signal transfers or stored values.
When cost is not an issue, system designers typically
address transient faults by relying on large amounts of re-
dundant hardware [8, 27, 32, 33]. While effective, this re-
dundancy is prohibitively expensive for arenas outside of
the high-end, high-availability market, rendering these tech-
niques impractical for the desktop and embedded computing
markets. For example, protecting the register file with ECC
has shown to be extremely costly in terms of both perfor-
mance [28] and power [19].
To provide protection when hardware costs are prohibitive,
software-only approaches have been proposed as alterna-
tives [15, 17, 25, 29]. In particular, techniques such as
SWIFT [23] have demonstrated that high reliability can be
achieved through a software-only fault-detection solution
which degrades performance modestly. These software-only
reliability techniques are valuable because they do not require
any hardware support. They can be applied to future designs
without any hardware changes or even to currently deployed
systems. Software-only approaches also allow for software-
control; the user, the application, or the system may dynami-
cally reconfigure the trade-off between reliability and perfor-
mance after the system has been deployed to best suit varying
conditions.
However, detecting faults is only part of the path to full
fault tolerance. In order to truly be reliable, a system must
also be able to recover from faults. Until now, all proposed
low-level software-only techniques of which we are aware
have addressed only fault detection, not fault recovery. Al-
though this prevents faults from corrupting data, it does not
allow the application to correctly run to completion in the
presence of a fault.
In this paper, we present three novel, software-only re-
covery techniques at the compiler level which offer varying
levels of protection. The first is SWIFT-R, which is based
on SWIFT [23], an existing software-only detection scheme.
The SWIFT-R technique intertwines three copies of a pro-
gram and adds majority voting before critical instructions of-
fering near-perfect reliability for those applications that re-
quire it.
The second technique we present is TRUMP (Triple
Redundancy Using Multiplication Protection), which inter-
twines the original program with an AN -encoded version
of the program. Section 4.1 will give an overview of AN -
encoding, a more efficient representation of redundant infor-
mation than simple triplication. At certain points in the pro-
gram, the original and AN -encoded versions are compared
and recovery code is triggered if a mismatch is detected. The
AN -encoding of TRUMP allows recovery although only two
versions of the program are computed. Although TRUMP’s
AN -encoding is not as general as SWIFT-R’s triple-modular
redundancy, rendering it unable to protect certain portions
of programs, TRUMP’s redundant computation is much less
onerous, providing an alternative for applications that cannot
afford the performance penalty of SWIFT-R, but could bene-
fit from moderate protection.
The last technique, MASK, dynamically enforces invari-
ants that can be proved true statically. By merely assert-
ing statically known facts at various points in the program,
MASK is able to improve the reliability of the system with-
out adding redundancy. The MASK technique is more
lightweight than the other two techniques but can still sub-
stantially increase reliability in some cases.
We implemented SWIFT-R, TRUMP, and MASK in a
Proceedings of the 2006 International Conference on Dependable Systems and Networks (DSN’06)
0-7695-2607-1/06 $20.00 © 2006 IEEE