Compiler Optimizations for Fault Tolerance Software Checking * Jing Yu and Mar´ ıa Jes ´ us Garzar´ an Department of Computer Science University of Illinois at Urbana-Champaign jingyu, garzaran@cs.uiuc.edu Abstract Dramatic increases in the number of transistors that can be integrated on a chip will make the hardware more susceptible to radiation-induced transient errors. High- end architectures like the IBM mainframes, HP NonStop or mission-critical computers are likely to include sev- eral hardware-intensive fault tolerance techniques. How- ever, the commodity chips which are cost- and energy- constrained, will need a more flexible and inexpensive tech- nology for error detection. Software approaches can play a major role for this sector of the market because they need little hardware modification and can be tailored to fit differ- ent requirements of reliability and performance. Current software approaches address the problem by replicating the instructions and adding checking instruc- tions to compare the results [1, 2, 3, 4, 5]. These checking instructions account for a significant fraction of the added overhead. In this work we propose a set of compiler op- timizations to identify and remove redundant checks from the replicated code. Two checks are considered redundant if they check the same variable. In this case, it is possible to remove the check that appears first during execution so that an error will be detected when the second check executes. However, determining how much a check can be delayed is tricky. If we delay it too little, there is little room for opti- mization. If we delay it too much, the errors will propagate to undesired places and result in segmentation faults, cor- rupted memory, wrong execution path, or undetected errors across checkpoints. We consider that how much the error detection can be delayed will depend on the recovery mech- anism supported by the hardware or the system. As long as checks are not delayed beyond synchronization checkpoints, the system will be able to properly recover. With our techniques the user can define what are the syn- chronization checkpoints based on the hardware support for recovery. In this work we evaluate two levels of hardware or system support: memory without support for checkpointing * This work was supported in part by the National Science Foundation under the CSR-AES program (Grant No. 0615273) and rollback, where memory is guaranteed to not be cor- rupted with wrong values and memory with low-cost sup- port for checkpointing and rollback. We also consider the situation where register file is protected with parity or ECC, such as Intel Itanium, Sun UltraSPARC and IBM Power4- 6 because software implementations can take advantage of this hardware feature and reduce some of the replicated in- structions. We have evaluated our approach using LLVM as our compiler infrastructure and PIN for fault injection. Our experimental results with Spec benchmarks on a Pentium 4 show that in the case where memory is guaranteed not to be corrupted, performance improves by an average 6.2%. With more support for checkpoint performance improves by an average 14.7%. A software fault tolerant system that takes advantage of the register safe platforms improves by an average 16.0%. Fault injection experiments show that our techniques do not decrease fault coverage, although they slightly increase the number of segmentation faults. References [1] J. Chang, G. A. Reis, and D. I. August. Automatic Instruction- Level Software-Only Recovery. In DSN ’06: Proceedings of the International Conference on Dependable Systems and Networks (DSN’06), pages 83–92, Washington, DC, USA, 2006. IEEE Computer Society. [2] N. Oh, P. Shirvani, and E. J. McCluskey. Error Detection by Duplicated Instructions in Super-scalar Processors. IEEE Transactions on Reliability, 51(1):63–75, March 2002. [3] G. A. Reis, J. Chang, N. Vachharajani, R. Rangan, and D. I. August. SWIFT: Software Implemented Fault Tolerance. In Proc. of the International Symposium on Code Generation and Optimization (CGO), 2005. [4] G. A. Reis, J. Chang, N. Vachharajani, R. Rangan, D. I. Au- gust, and S. S. Mukherjee. Software-controlled fault toler- ance. ACM Trans. Archit. Code Optim., 2(4):366–396, 2005. [5] G. A. Reis, J.Chang, D. I. August, R. Cohn, and S. S. Mukher- jee. Configurable Transient Fault Detection via Dynamic Bi- nary Translation. In Proceedings of the 2nd Workshop on Ar- chitectural Reliability (WAR), 2006. 16th International Conference on Parallel Architecture and Compilation Techniques (PACT 2007) 0-7695-2944-5/07 $25.00 © 2007