Encore: Low-Cost, Fine-Grained Transient Fault Recovery Shuguang Feng † , Shantanu Gupta † , Amin Ansari † , Scott A. Mahlke † , and David I. August ‡ † Advanced Computer Architecture Laboratory ‡ Department of Computer Science University of Michigan Princeton University Ann Arbor, MI Princeton, NJ {shoe, shangupt, ansary, mahlke}@umich.edu august@cs.princeton.edu ABSTRACT To meet an insatiable consumer demand for greater performance at less power, silicon technology has scaled to unprecedented dimen- sions. However, the pursuit of faster processors and longer battery life has come at the cost of reliability. Given the rise of proces- sor reliability as a ﬁrst-order design constraint, there has been a growing interest in low-cost, non-intrusive techniques for transient fault detection. Many of these recent proposals have counted on the availability of hardware recovery mechanisms. Although com- mon in aggressive out-of-order cores, hardware support for specu- lative rollback and recovery is less common in lower-end commod- ity processors. This paper presents Encore, a software-based fault recovery mechanism tailored for these lower-cost systems that lack native hardware support for speculative rollback recovery. Encore combines program analysis, proﬁle data, and simple code trans- formations to create statistically idempotent code regions that can recover from faults at very little cost. Using this software-only, compiler-based approach, Encore provides the ability to recover from transient faults without specialized hardware or the costs of traditional, full-system checkpointing solutions. Experimental re- sults show that Encore, with just 14% of runtime overhead, can safely recover, on average from 97% of transient faults when cou- pled with existing detection schemes. Categories and Subject Descriptors B.8.1 [Performance and Reliability]: Reliability, Testing, and Fault Tolerance; D.3.4 [Programming Languages]: Processors— Compilers General Terms Algorithms, Design, Reliability 1. INTRODUCTION Although it is impossible to build a completely reliable system, hardware vendors attempt to target failure rates that are impercep- tibly small. With the course of aggressive technology scaling that has been followed by industry, many sources of unreliability are Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for proﬁt or commercial advantage and that copies bear this notice and the full citation on the ﬁrst page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior speciﬁc permission and/or a fee. MICRO‘11, December 3–7, 2011, Porto Alegre, Brazil. Copyright 2011 ACM 978-1-4503-1053-6/11/12 ...$10.00. emerging in commercial processors. One prominent source, and the focus of this paper, is soft errors. Also known as transient faults, they can be induced by a variety of phenomena like electrical noise and high-energy particle strikes that result from cosmic radiation and chip packaging impurities. Additionally, in newly proposed ar- chitectures that embrace the principles of stochastic [23] and near threshold computing [5], they can also be the result of extreme tim- ing speculation and/or frequency and voltage scaling. Traditionally, architects have designed systems that would take periodic checkpoints of processor and memory state. In the event of a soft error the system could rollback to an existing, fault-free snapshot and continue execution (rollback recovery). These highly robust fault recovery solutions have historically also relied on some form of modular redundancy to provide the necessary detection ca- pabilities. Available in spatial and temporal variants, modular re- dundancy generally involved redundant execution (either on sepa- rate hardware or in separate software contexts) followed by detailed comparisons that would identify the presence of a fault [1, 27, 21, 19]. However, the resultant overheads of these coupled detection and recovery schemes, a large component of which was the cost of creating checkpoints, usually relegated their use to to high-end, enterprise systems [4]. These simple yet elegant techniques, having served those in the mission-critical server arena for decades, are not practical outside this niche domain. Although reliability cannot be completely ig- nored in lower-end systems, they are not usually designed to pro- vide the “ﬁve-nines” of fault tolerance capable of sending some- one safely to the moon. That said, the overheads associated with these conventional solutions are prohibitively expensive for budget- conscious designers with less demanding reliability requirements. In fact, this is the same argument made by [8], and to a similar ex- tent [3], which argues that most commodity systems do not require reliability guarantees but will settle for probabilistic, best effort, fault tolerance. This insight has sparked a recent interest in transient fault de- tection techniques that are able to maintain low runtime overheads by sacriﬁcing a small degree of reliability, focusing primarily on addressing the bulk of faults that are relatively inexpensive to de- tect [29, 10, 8]. However, these techniques [29, 8] assume hard- ware provides rollback recovery, arguing that such hardware would already exist to support performance speculation. Although this argument may hold for aggressive out-of-order processors, such hardware support is not present in the majority of low-end com- modity systems. With that in mind, we propose, Encore, a software-only solution that seeks to provide probabilistic (best effort) rollback recovery capabilities at minimal costs. Encore was developed to comple- ment emerging probabilistic detection techniques, enabling them