Fuse: A Technique to Anticipate Failures due to Degradation in ALUs
Jaume Abella, Xavier Vera, Osman Unsal
†
, Oguz Ergin
‡
, Antonio González
Intel Barcelona Research Center, Intel Labs – UPC
{jaumex.abella, xavier.vera, antonio.gonzalez}@intel.com
Abstract
Ψ‡
This paper proposes the fuse, a technique to anticipate
failures due to degradation in any ALU (Arithmetic Logic
Unit), and particularly in an adder. The fuse consists of a
replica of the weakest transistor in the adder and the
circuitry required to measure its degradation. By
mimicking the behavior of the replicated transistor the
fuse anticipates the failure short before the first failure in
the adder appears, and hence, data corruption and
program crashes can be avoided. Our results show that
the fuse anticipates the failure in more than 99.9% of the
cases after 96.6% of the lifetime, even for pessimistic
random within-die variations.
1. Introduction
As technology evolves, the geometry of transistors and
wires shrinks. However, supply voltage does not scale at
the same pace [23]; this causes transistors and wires to
suffer higher current densities, which also imply higher
temperatures. The increased current density and
temperature translate into higher vulnerability of circuits.
Under these conditions, transistors and wires will degrade
faster and will be more prone to failures (higher failure
rate per device). Furthermore, there will be an increased
number of failures in the chip because of the larger
number of such devices (transistors shrink but the chip
size is expected to remain constant [23]).
The increasing unreliability of processors will make
devices fail frequently during the normal lifetime of the
processor. Moreover, transistor geometry may change
significantly from one chip to another or even within the
chip itself, in such a way that some components are prone
to degrade faster than others. Similarly, dynamic
variations of operating frequency, voltage and
temperature may accelerate degradation significantly for
some blocks. Thus, lifetime of blocks in a chip is
unpredictable and mechanisms are required to detect
failures before such failures produce crashes or data
corruption.
†
Osman Unsal is currently with the Barcelona Supercomputing
Center, Spain (osmal.unsal@bsc.es)
‡
Oguz Ergin is currently with the TOBB University of Economics
and Technology, Ankara, Turkey (oergin@etu.edu.tr)
Such unreliability can be addressed in several ways.
One solution consists in testing the blocks for errors
[11][18] and reconfigure the system accordingly.
However, testing only avoids future crashes, but it does
not prevent the system from crashing whenever failures
show up for the first time.
Another set of solutions is based on detecting failures
and avoiding data corruption. Memory-like structures,
such as caches and register files, can be protected with
ECC [9], which is useful to detect transient and permanent
errors. ALUs are very likely to cause crashes and data
corruption because most of the instructions use them, and
thus, it is mandatory to protect them. However,
combinational blocks like ALUs cannot use ECC or
parity, and require more expensive techniques like
reexecution or residue codes computation among others.
Reexecution can be performed in a different instance of
the same type of ALU [16][17] or in a special ALU
devoted to error detection [4]. Both solutions are
expensive either in terms of performance and/or extra
hardware. Residue computation [5][12] is an alternative to
detect failures in ALUs, especially in adders. Obtaining
residue codes requires special hardware to compute the
modulo function.
This paper proposes the fuse, a new technique to
anticipate failures due to degradation in ALUs at a very
low cost. In particular, we propose the design and
implementation of a fuse to anticipate failures in a sparse
tree adder [14], which is the one used in the Intel®
Pentium® 4. The fuse is built as a replica of the weakest
transistor in the adder and the circuitry required to
measure its degradation. Whenever the fuse does not meet
the delay constraints, it implies that the protected adder is
about to fail, so it can be disabled or its frequency
decreased [19] to prevent data corruption and program
crashes.
The fuse is a very efficient solution in terms of
hardware and power. We illustrate how to design and
implement a fuse for a sparse tree adder, although the
same idea can be extended to any other ALU without
requiring any special property for the protected block, as
it is the case for residue computation.
The rest of the paper is organized as follows. Section 2
introduces the main sources of failure affecting
microprocessors. Section 3 presents the fuse, our
technique to anticipate failures in adders. Section 4
presents the evaluation of the fuse. Section 5 reviews
13th IEEE International On-Line Testing Symposium (IOLTS 2007)
0-7695-2918-6/07 $25.00 © 2007