Automatic Instruction-Level Software-Only Recovery Jonathan Chang George A. Reis David I. August Departments of Electrical Engineering and Computer Science Princeton University Princeton, NJ 08544 {jcone,gareis,august}@princeton.edu Abstract As chip densities and clock rates increase, processors are becoming more susceptible to transient faults that can affect program correctness. Computer architects have typically ad- dressed reliability issues by adding redundant hardware, but these techniques are often too expensive to be used widely. Software-only reliability techniques have shown promise in their ability to protect against soft-errors without any hard- ware overhead. However, existing low-level software-only fault tolerance techniques have only addressed the problem of detecting faults, leaving recovery largely unaddressed. In this paper, we present the concept, implementation, and eval- uation of automatic, instruction-level, software-only recov- ery techniques, as well as various specific techniques rep- resenting different trade-offs between reliability and perfor- mance. Our evaluation shows that these techniques fulfill the promises of instruction-level, software-only fault tolerance by offering a wide range of flexible recovery options. 1 Introduction In recent decades, microprocessor performance has been increasing exponentially, due in large part to smaller and faster transistors enabled by improved fabrication technology. While such transistors yield performance enhancements, their lower threshold voltages and tighter noise margins make them less reliable [3, 14, 26], rendering processors that use them more susceptible to transient faults. Transient faults are in- termittent faults caused by external events, such as energetic particles striking the chip, that do not cause permanent dam- age, but may result in incorrect program execution by altering signal transfers or stored values. When cost is not an issue, system designers typically address transient faults by relying on large amounts of re- dundant hardware [8, 27, 32, 33]. While effective, this re- dundancy is prohibitively expensive for arenas outside of the high-end, high-availability market, rendering these tech- niques impractical for the desktop and embedded computing markets. For example, protecting the register file with ECC has shown to be extremely costly in terms of both perfor- mance [28] and power [19]. To provide protection when hardware costs are prohibitive, software-only approaches have been proposed as alterna- tives [15, 17, 25, 29]. In particular, techniques such as SWIFT [23] have demonstrated that high reliability can be achieved through a software-only fault-detection solution which degrades performance modestly. These software-only reliability techniques are valuable because they do not require any hardware support. They can be applied to future designs without any hardware changes or even to currently deployed systems. Software-only approaches also allow for software- control; the user, the application, or the system may dynami- cally reconfigure the trade-off between reliability and perfor- mance after the system has been deployed to best suit varying conditions. However, detecting faults is only part of the path to full fault tolerance. In order to truly be reliable, a system must also be able to recover from faults. Until now, all proposed low-level software-only techniques of which we are aware have addressed only fault detection, not fault recovery. Al- though this prevents faults from corrupting data, it does not allow the application to correctly run to completion in the presence of a fault. In this paper, we present three novel, software-only re- covery techniques at the compiler level which offer varying levels of protection. The first is SWIFT-R, which is based on SWIFT [23], an existing software-only detection scheme. The SWIFT-R technique intertwines three copies of a pro- gram and adds majority voting before critical instructions of- fering near-perfect reliability for those applications that re- quire it. The second technique we present is TRUMP (Triple Redundancy Using Multiplication Protection), which inter- twines the original program with an AN -encoded version of the program. Section 4.1 will give an overview of AN - encoding, a more efficient representation of redundant infor- mation than simple triplication. At certain points in the pro- gram, the original and AN -encoded versions are compared and recovery code is triggered if a mismatch is detected. The AN -encoding of TRUMP allows recovery although only two versions of the program are computed. Although TRUMP’s AN -encoding is not as general as SWIFT-R’s triple-modular redundancy, rendering it unable to protect certain portions of programs, TRUMP’s redundant computation is much less onerous, providing an alternative for applications that cannot afford the performance penalty of SWIFT-R, but could bene- fit from moderate protection. The last technique, MASK, dynamically enforces invari- ants that can be proved true statically. By merely assert- ing statically known facts at various points in the program, MASK is able to improve the reliability of the system with- out adding redundancy. The MASK technique is more lightweight than the other two techniques but can still sub- stantially increase reliability in some cases. We implemented SWIFT-R, TRUMP, and MASK in a Proceedings of the 2006 International Conference on Dependable Systems and Networks (DSN’06) 0-7695-2607-1/06 $20.00 © 2006 IEEE