Fuse: A Technique to Anticipate Failures due to Degradation in ALUs Jaume Abella, Xavier Vera, Osman Unsal † , Oguz Ergin ‡ , Antonio González Intel Barcelona Research Center, Intel Labs – UPC {jaumex.abella, xavier.vera, antonio.gonzalez}@intel.com Abstract Ψ‡ This paper proposes the fuse, a technique to anticipate failures due to degradation in any ALU (Arithmetic Logic Unit), and particularly in an adder. The fuse consists of a replica of the weakest transistor in the adder and the circuitry required to measure its degradation. By mimicking the behavior of the replicated transistor the fuse anticipates the failure short before the first failure in the adder appears, and hence, data corruption and program crashes can be avoided. Our results show that the fuse anticipates the failure in more than 99.9% of the cases after 96.6% of the lifetime, even for pessimistic random within-die variations. 1. Introduction As technology evolves, the geometry of transistors and wires shrinks. However, supply voltage does not scale at the same pace [23]; this causes transistors and wires to suffer higher current densities, which also imply higher temperatures. The increased current density and temperature translate into higher vulnerability of circuits. Under these conditions, transistors and wires will degrade faster and will be more prone to failures (higher failure rate per device). Furthermore, there will be an increased number of failures in the chip because of the larger number of such devices (transistors shrink but the chip size is expected to remain constant [23]). The increasing unreliability of processors will make devices fail frequently during the normal lifetime of the processor. Moreover, transistor geometry may change significantly from one chip to another or even within the chip itself, in such a way that some components are prone to degrade faster than others. Similarly, dynamic variations of operating frequency, voltage and temperature may accelerate degradation significantly for some blocks. Thus, lifetime of blocks in a chip is unpredictable and mechanisms are required to detect failures before such failures produce crashes or data corruption. † Osman Unsal is currently with the Barcelona Supercomputing Center, Spain (osmal.unsal@bsc.es) ‡ Oguz Ergin is currently with the TOBB University of Economics and Technology, Ankara, Turkey (oergin@etu.edu.tr) Such unreliability can be addressed in several ways. One solution consists in testing the blocks for errors [11][18] and reconfigure the system accordingly. However, testing only avoids future crashes, but it does not prevent the system from crashing whenever failures show up for the first time. Another set of solutions is based on detecting failures and avoiding data corruption. Memory-like structures, such as caches and register files, can be protected with ECC [9], which is useful to detect transient and permanent errors. ALUs are very likely to cause crashes and data corruption because most of the instructions use them, and thus, it is mandatory to protect them. However, combinational blocks like ALUs cannot use ECC or parity, and require more expensive techniques like reexecution or residue codes computation among others. Reexecution can be performed in a different instance of the same type of ALU [16][17] or in a special ALU devoted to error detection [4]. Both solutions are expensive either in terms of performance and/or extra hardware. Residue computation [5][12] is an alternative to detect failures in ALUs, especially in adders. Obtaining residue codes requires special hardware to compute the modulo function. This paper proposes the fuse, a new technique to anticipate failures due to degradation in ALUs at a very low cost. In particular, we propose the design and implementation of a fuse to anticipate failures in a sparse tree adder [14], which is the one used in the Intel® Pentium® 4. The fuse is built as a replica of the weakest transistor in the adder and the circuitry required to measure its degradation. Whenever the fuse does not meet the delay constraints, it implies that the protected adder is about to fail, so it can be disabled or its frequency decreased [19] to prevent data corruption and program crashes. The fuse is a very efficient solution in terms of hardware and power. We illustrate how to design and implement a fuse for a sparse tree adder, although the same idea can be extended to any other ALU without requiring any special property for the protected block, as it is the case for residue computation. The rest of the paper is organized as follows. Section 2 introduces the main sources of failure affecting microprocessors. Section 3 presents the fuse, our technique to anticipate failures in adders. Section 4 presents the evaluation of the fuse. Section 5 reviews 13th IEEE International On-Line Testing Symposium (IOLTS 2007) 0-7695-2918-6/07 $25.00 © 2007