Runtime Asynchronous Fault Tolerance via Speculation Yun Zhang Soumyadeep Ghosh Jialu Huang Jae W. Lee Scott A. Mahlke David I. August Princeton University, Princeton, New Jersey, USA § SungKyunKwan University, Suwon, Korea University of Michigan, Ann Arbor, Michigan, USA ABSTRACT Transient faults are emerging as a critical reliability concern in modern microprocessors. Redundant hardware solutions are com- monly deployed to detect transient faults, but they are less flexible and cost-effective than software solutions. However, software so- lutions are rendered impractical because of high performance over- heads. To address this problem, this paper presents Runtime Asyn- chronous Fault Tolerance via Speculation (RAFT), the fastest tran- sient fault detection technique known to date. Serving as a layer between the application and the underlying platform, RAFT auto- matically generates two symmetric program instances from a pro- gram binary. It detects transient faults in a non-invasive way and exploits high-confidence value speculation to achieve low runtime overhead. Evaluation on a commodity multicore system demon- strates that RAFT delivers a geomean performance overhead of 2.83% on a set of 30 SPEC CPU benchmarks and STAMP bench- marks. Compared with existing transient fault detection techniques, RAFT exhibits the best performance and fault coverage, without re- quiring any change to the hardware or the software applications. 1. INTRODUCTION Transient faults, also known as soft errors, are caused by external events such as particle strikes [3, 21, 25, 30]. These faults may lead to program crash or system failure, without leaving any trace. A combination of exponentially growing transistor counts and volt- age scaling makes transient faults a critical concern for the semi- conductor industry. Oracle America Inc. acknowledges that clients including America Online (AOL), eBay and Los Alamos National Labs have suffered from system failures due to transient faults [4, 18]. A recent study shows that a BlueGene/L machine with 104 nodes deployed in Lawrence Livermore National Labs experiences soft errors once every four hours [8]. Given that the reliability per bit is estimated to drop 8% per generation of processors [6], it is critical to ensure fast and effective transient fault tolerance on mod- ern and future architectures. Recently proposed transient fault detection techniques rely on re- dundant execution in either hardware or software. Specialized re- Copyright c ACM, 2012. This is the author’s version of the work. It is posted here by permission of ACM for your personal use. Not for redistribution. The definitive version was published in CGO, 2012, http://doi.acm.org/10.1145/XXXXXX. dundant hardware is commonly employed to detect transient faults transparently. For example, IBM S/390 [33], Boeing 777 airplanes [41], and HP’s Non-stop [13] all use redundant hardware for fault tolerance. However, these solutions require specialized hardware components and additional verification cost [2, 33, 29]. Moreover, hardware solutions cannot adapt to changes in deployment environ- ment or scope of protection. Current architectural trends toward multicore microprocessors nat- urally provide additional resources, making software redundant ex- ecution more viable than ever. Existing software proposals [22, 26, 31, 37, 42] typically insert redundant code into a program at compile time or runtime, and check for transient faults at runtime. Among these proposals, compiler-based techniques [26, 31, 37, 42] are only applicable to programs whose source codes are available. Separately compiled modules, such as libraries, cannot be protected using compiler-based techniques due to the absence of source code at compile time. Runtime techniques, such as PLR [31], use dy- namic instrumentation to duplicate program execution at the pro- cess level and instrument binaries for fault detection. This approach still has high performance overhead due to the cost of dynamic bi- nary instrumentation and barrier synchronizations at every system call. To address the performance and applicability issues of software fault detection techniques, this paper presents RAFT, a Runtime Asynchronous Fault Tolerance technique that detects transient faults with low overhead. RAFT serves as a light-weight virtual layer between an application and the underlying platform. It takes a program binary as input and duplicates its execution automatically. During execution, it monitors both original and duplicated program instances’ behavior at the system call level using a process moni- toring utility provided by the operating system. The arguments of system calls from both instances are compared for equality. A value mismatch means a transient fault has occurred and RAFT reports this to the user. Unlike compiler-based techniques that must ob- tain knowledge of library functions for fault detection, RAFT must only understand the relatively stable and well-defined set of system calls. The key insight behind RAFT is that redundant execution can be ac- celerated by speculatively removing data dependences. Whenever possible, RAFT allows the process that first invokes a system call to continue execution with a speculated return value, without execut- ing the call. When the other process invokes the same system call, RAFT compares the arguments of the two invocations to check for transient faults. If the arguments mismatch, RAFT reports transient faults and stops program execution. If no fault occurred, the system call is executed and its return value is checked against the specu-