978-1-5090-1613-6/17/$31.00 ©2017 IEEE 1 Hybrid, Adaptive, and Reconfigurable Fault Tolerance Christopher Wilson NSF CHREC Center ECE Dept., University of Florida Room 317, Benton Hall Gainesville, FL 32611-6200 wilson@chrec.org Sebastian Sabogal NSF CHREC Center ECE Dept., University of Pittsburgh Room 1238D, Benedum Hall Pittsburgh, PA 15261 ssabogal@chrec.org Alan George NSF CHREC Center ECE Dept., University of Florida & University of Pittsburgh Room 1238D, Benedum Hall Pittsburgh, PA 15261 george@chrec.org Ann Gordon-Ross NSF CHREC Center ECE Dept., University of Florida Room 319, Larsen Hall Gainesville, FL 32611-6200 ann@chrec.org AbstractThe main design challenge in developing space computers featuring hybrid system-on-chip (SoC) devices is determining the optimal combination of size, weight, power, cost, performance, and reliability for the target mission, while addressing the complexity associated with combining fixed and reconfigurable logic. This paper focuses upon fault-tolerant computing with adaptive hardware redundancy in fixed and reconfigurable logic, with the goal of providing and evaluating tradeoffs in system reliability, performance, and resource utilization. Our research targets the hybrid Xilinx Zynq SoC as the primary computational device on a flight computer. Typically, flight software on a Zynq runs on the ARM cores that by default operate in symmetric multiprocessing (SMP) mode. However, radiation tests have shown this mode can leave the system prone to upsets. To address this limitation, we present a new framework (HARFT: hybrid adaptive reconfigurable fault tolerance) that enables switching between three operating modes: (1) ARM cores running together in SMP mode; (2) ARM cores running independently in asymmetric multiprocessing (AMP) mode; and (3) an FPGA- enhanced mode for fault tolerance. While SMP is the default mode, AMP mode may be used for fault-tolerant and real-time extensions. Additionally, the FPGA-enhanced mode uses partially reconfigurable regions to vary the level of redundancy and include application- and environment-specific techniques for fault mitigation and application acceleration. TABLE OF CONTENTS 1. INTRODUCTION AND BACKGROUND ...................... 1 2. BACKGROUND ......................................................... 2 3. APPROACH .............................................................. 4 4. EXPERIMENTS AND RESULTS ................................ 6 5. CONCLUSION ........................................................... 9 ACKNOWLEDGEMENTS ............................................ 10 REFERENCES ............................................................. 10 BIOGRAPHY............................................................... 11 1. INTRODUCTION AND BACKGROUND Due to continuing innovations in sensors and research into autonomous operations, space processing has been unable to satisfy computing demands for new mission requirements. A major challenge for both commercial and government space organizations is development of new, higher-performance, space-qualified processors for new missions. Space missions include unique requirements, with dramatic restrictions in size, weight, power, and cost (SwaP-C), and reliability demands in the presence of unique hazards (radiation, temperature, vibration, vacuum), which often have no corresponding terrestrial applications, and so technology developers must consider these requirements closely [1]. Space is a hazardous environment that necessitates special considerations for computing designs to work as intended. A plethora of particles from varying radiation sources can affect electronic components [2]. Radiation effects can be broadly organized into two categories: short-term transient effects and long-term cumulative effects. Transient effects can be further classified into “soft” (recoverable/non- destructive) and “hard” (non-recoverable/destructive) errors. Soft errors widely include all types of single-event effects (SEEs) such as single-event upsets (SEU), single-event functional interrupts (SEFI), and single-event transients (SET). Hard errors typically include single-event latch-up (SEL), single-event burnout (SEB), and single-event gate rupture (SEGR). These effects are extensively covered in [3], [4], and [5]. To better prepare spaceflight projects and payloads for exposure to these hazards, NASA developed a multi-step approach for design development that addresses radiation concerns. This approach was entitled Radiation Hardness Assurance (RHA) published in 1998 by LaBel et al. in [6] and revised in [7]. General-purpose processors and FPGAs can manifest radiation errors from SEEs differently. The main source of radiation concerns for SRAM-based FPGAs is corruption in the device-routing configuration memory and app-oriented block RAMs. Configuration memory allows the FPGA to maintain its pre-programmed, architecture-specific design; therefore, an upset to configuration memory can dramatically change the desired function of the device. These memory structures along with flip-flops are particularly vulnerable to radiation. To counter errors with radiation effects, designers employ configuration memory scrubbing. Scrubbing is the process of quickly repairing configuration-bit upsets in the FPGA before they render the device inoperable [8]. Additionally, designers use Error- Correction Codes (ECC) and parity schemes for block RAMs and some FPGA configuration memory. Finally, a common approach is to triplicate design structures in the