OVERHEAD AND RELIABILITY ANALYSIS OF ALGORITHM-BASED FAULT TOLERANCE IN FPGA SYSTEMS Adam Jacobs, Grzegorz Cieslewski, and Alan D. George NSF Center for High-Performance Reconfigurable Computing (CHREC) Department of Electrical and Computer Engineering, University of Florida email: {jacobs, cieslewski, george}@chrec.org ABSTRACT Commercial SRAM-based, field-programmable gate arrays (FPGAs) have the capability to provide space applica- tions with the necessary performance, energy-efficiency, and adaptability to meet next-generation mission requirements. However, mitigating an FPGA’s susceptibility to radiation- induced faults is challenging. Triple-modular redundancy (TMR) techniques are traditionally used to mitigate radi- ation effects, but TMR incurs substantial overheads such as increased area and power requirements. In order to reduce these overheads while still providing sufficient ra- diation mitigation, we propose the use of algorithm-based fault tolerance (ABFT). We investigate the effectiveness of hardware-based ABFT logic in COTS FPGAs by develop- ing multiple ABFT-enabled matrix multiplication designs, carefully analyzing resource usage and reliability tradeoffs, and proposing design modifications for higher reliability. We perform fault-injection testing on a Xilinx Virtex-5 platform to validate these ABFT designs, measure design vulnerability, and compare ABFT effectiveness to other fault-tolerance methods. Our hybrid ABFT design reduces total design vulnerability by 99% while only incurring 25% overhead over a baseline, non-protected design. 1. INTRODUCTION As remote sensor technology for space systems increases in fidelity, the amount of data collected by orbiting satel- lites and other space vehicles will continue to outpace the ability to transmit that data to other stations (e.g., ground stations, other satellites). By increasing the onboard data- processing capabilities of future systems, raw data can be interpreted, reduced, and/or compressed onboard the space system before transmitting the results to ground stations, thus reducing data transmission requirements. Traditionally, radiation-hardened devices, with increased protection from long-term radiation exposure (total ionizing dose), provide This work was supported in part by the I/UCRC Program of the National Science Foundation under Grant No. EEC-0642422. system reliability and correctness for these space systems. However, these hardened devices are very expensive and have dramatically reduced performance as compared to non- hardened commercial-off-the-shelf (COTS) components. One approach for high-performance space system de- sign leverages hardware-adaptive devices such as field- programmable gate arrays (FPGAs). COTS FPGAs can provide parallel computations at a high level of performance per unit size, mass, and power [1]. Fortunately, many space applications, such as synthetic aperture radar (SAR) [2], hyperspectral imaging (HSI) [3], image compression [4], and other image processing applications [5], where onboard data processing can significantly reduce data transmission requirements, are amenable to an FPGA’s highly parallel architecture. In order to leverage FPGAs in space systems, the FPGA must operate correctly and reliably in high-radiation envi- ronments. Fortunately, when combined with special system design techniques, SRAM-based FPGAs can be viable for space systems. An SRAM-based FPGA’s primary compu- tational limitation is the possibility of SEUs causing errors within the FPGA user logic and routing resources, which can manifest as configuration memory upsets or logic mem- ory (e.g., flip-flops, user RAM) upsets, resulting in devia- tions from the expected application behavior. Fault-tolerant techniques, such as triple-modular redundancy (TMR) and memory scrubbing, can protect the system from most SEUs and significantly decrease the SEU-induced errors, but de- signing an FPGA-based space system using TMR introduces at least 200% area overhead for each protected module. Depending on the expected upset rates for a given space system, other fault-tolerance methods could be used to provide sufficient reliability while maximizing the resources available for performance. Algorithm-based fault tolerance (ABFT) is a method that can be used with many linear-algebra operations, such as matrix multiplication or LU decomposition [6]. For- tunately, many space applications are composed of linear- algebra operations; e.g., HSI features matrix multiplication, while SAR features fast Fourier transforms. Other algo-