Compiler Optimizations for Fault Tolerance Software Checking
*
Jing Yu and Mar´ ıa Jes ´ us Garzar´ an
Department of Computer Science
University of Illinois at Urbana-Champaign
jingyu, garzaran@cs.uiuc.edu
Abstract
Dramatic increases in the number of transistors that
can be integrated on a chip will make the hardware more
susceptible to radiation-induced transient errors. High-
end architectures like the IBM mainframes, HP NonStop
or mission-critical computers are likely to include sev-
eral hardware-intensive fault tolerance techniques. How-
ever, the commodity chips which are cost- and energy-
constrained, will need a more flexible and inexpensive tech-
nology for error detection. Software approaches can play a
major role for this sector of the market because they need
little hardware modification and can be tailored to fit differ-
ent requirements of reliability and performance.
Current software approaches address the problem by
replicating the instructions and adding checking instruc-
tions to compare the results [1, 2, 3, 4, 5]. These checking
instructions account for a significant fraction of the added
overhead. In this work we propose a set of compiler op-
timizations to identify and remove redundant checks from
the replicated code. Two checks are considered redundant
if they check the same variable. In this case, it is possible to
remove the check that appears first during execution so that
an error will be detected when the second check executes.
However, determining how much a check can be delayed is
tricky. If we delay it too little, there is little room for opti-
mization. If we delay it too much, the errors will propagate
to undesired places and result in segmentation faults, cor-
rupted memory, wrong execution path, or undetected errors
across checkpoints. We consider that how much the error
detection can be delayed will depend on the recovery mech-
anism supported by the hardware or the system. As long as
checks are not delayed beyond synchronization checkpoints,
the system will be able to properly recover.
With our techniques the user can define what are the syn-
chronization checkpoints based on the hardware support for
recovery. In this work we evaluate two levels of hardware or
system support: memory without support for checkpointing
*
This work was supported in part by the National Science Foundation
under the CSR-AES program (Grant No. 0615273)
and rollback, where memory is guaranteed to not be cor-
rupted with wrong values and memory with low-cost sup-
port for checkpointing and rollback. We also consider the
situation where register file is protected with parity or ECC,
such as Intel Itanium, Sun UltraSPARC and IBM Power4-
6 because software implementations can take advantage of
this hardware feature and reduce some of the replicated in-
structions.
We have evaluated our approach using LLVM as our
compiler infrastructure and PIN for fault injection. Our
experimental results with Spec benchmarks on a Pentium
4 show that in the case where memory is guaranteed not to
be corrupted, performance improves by an average 6.2%.
With more support for checkpoint performance improves by
an average 14.7%. A software fault tolerant system that
takes advantage of the register safe platforms improves by
an average 16.0%. Fault injection experiments show that
our techniques do not decrease fault coverage, although
they slightly increase the number of segmentation faults.
References
[1] J. Chang, G. A. Reis, and D. I. August. Automatic Instruction-
Level Software-Only Recovery. In DSN ’06: Proceedings
of the International Conference on Dependable Systems and
Networks (DSN’06), pages 83–92, Washington, DC, USA,
2006. IEEE Computer Society.
[2] N. Oh, P. Shirvani, and E. J. McCluskey. Error Detection
by Duplicated Instructions in Super-scalar Processors. IEEE
Transactions on Reliability, 51(1):63–75, March 2002.
[3] G. A. Reis, J. Chang, N. Vachharajani, R. Rangan, and D. I.
August. SWIFT: Software Implemented Fault Tolerance. In
Proc. of the International Symposium on Code Generation
and Optimization (CGO), 2005.
[4] G. A. Reis, J. Chang, N. Vachharajani, R. Rangan, D. I. Au-
gust, and S. S. Mukherjee. Software-controlled fault toler-
ance. ACM Trans. Archit. Code Optim., 2(4):366–396, 2005.
[5] G. A. Reis, J.Chang, D. I. August, R. Cohn, and S. S. Mukher-
jee. Configurable Transient Fault Detection via Dynamic Bi-
nary Translation. In Proceedings of the 2nd Workshop on Ar-
chitectural Reliability (WAR), 2006.
16th International Conference on
Parallel Architecture and Compilation Techniques (PACT 2007)
0-7695-2944-5/07 $25.00 © 2007