978-1-5090-1613-6/17/$31.00 ©2017 IEEE
1
Hybrid, Adaptive, and Reconfigurable Fault Tolerance
Christopher Wilson
NSF CHREC Center
ECE Dept., University of Florida
Room 317, Benton Hall
Gainesville, FL 32611-6200
wilson@chrec.org
Sebastian Sabogal
NSF CHREC Center
ECE Dept., University of Pittsburgh
Room 1238D, Benedum Hall
Pittsburgh, PA 15261
ssabogal@chrec.org
Alan George
NSF CHREC Center
ECE Dept., University of Florida &
University of Pittsburgh
Room 1238D, Benedum Hall
Pittsburgh, PA 15261
george@chrec.org
Ann Gordon-Ross
NSF CHREC Center
ECE Dept., University of Florida
Room 319, Larsen Hall
Gainesville, FL 32611-6200
ann@chrec.org
Abstract—The main design challenge in developing space
computers featuring hybrid system-on-chip (SoC) devices is
determining the optimal combination of size, weight, power,
cost, performance, and reliability for the target mission, while
addressing the complexity associated with combining fixed and
reconfigurable logic. This paper focuses upon fault-tolerant
computing with adaptive hardware redundancy in fixed and
reconfigurable logic, with the goal of providing and evaluating
tradeoffs in system reliability, performance, and resource
utilization. Our research targets the hybrid Xilinx Zynq SoC as
the primary computational device on a flight computer.
Typically, flight software on a Zynq runs on the ARM cores
that by default operate in symmetric multiprocessing (SMP)
mode. However, radiation tests have shown this mode can leave
the system prone to upsets. To address this limitation, we
present a new framework (HARFT: hybrid adaptive
reconfigurable fault tolerance) that enables switching between
three operating modes: (1) ARM cores running together in
SMP mode; (2) ARM cores running independently in
asymmetric multiprocessing (AMP) mode; and (3) an FPGA-
enhanced mode for fault tolerance. While SMP is the default
mode, AMP mode may be used for fault-tolerant and real-time
extensions. Additionally, the FPGA-enhanced mode uses
partially reconfigurable regions to vary the level of redundancy
and include application- and environment-specific techniques
for fault mitigation and application acceleration.
TABLE OF CONTENTS
1. INTRODUCTION AND BACKGROUND ...................... 1
2. BACKGROUND ......................................................... 2
3. APPROACH .............................................................. 4
4. EXPERIMENTS AND RESULTS ................................ 6
5. CONCLUSION ........................................................... 9
ACKNOWLEDGEMENTS ............................................ 10
REFERENCES ............................................................. 10
BIOGRAPHY............................................................... 11
1. INTRODUCTION AND BACKGROUND
Due to continuing innovations in sensors and research into
autonomous operations, space processing has been unable to
satisfy computing demands for new mission requirements. A
major challenge for both commercial and government space
organizations is development of new, higher-performance,
space-qualified processors for new missions. Space missions
include unique requirements, with dramatic restrictions in
size, weight, power, and cost (SwaP-C), and reliability
demands in the presence of unique hazards (radiation,
temperature, vibration, vacuum), which often have no
corresponding terrestrial applications, and so technology
developers must consider these requirements closely [1].
Space is a hazardous environment that necessitates special
considerations for computing designs to work as intended. A
plethora of particles from varying radiation sources can
affect electronic components [2]. Radiation effects can be
broadly organized into two categories: short-term transient
effects and long-term cumulative effects. Transient effects
can be further classified into “soft” (recoverable/non-
destructive) and “hard” (non-recoverable/destructive) errors.
Soft errors widely include all types of single-event effects
(SEEs) such as single-event upsets (SEU), single-event
functional interrupts (SEFI), and single-event transients
(SET). Hard errors typically include single-event latch-up
(SEL), single-event burnout (SEB), and single-event gate
rupture (SEGR). These effects are extensively covered in
[3], [4], and [5]. To better prepare spaceflight projects and
payloads for exposure to these hazards, NASA developed a
multi-step approach for design development that addresses
radiation concerns. This approach was entitled Radiation
Hardness Assurance (RHA) published in 1998 by LaBel et
al. in [6] and revised in [7].
General-purpose processors and FPGAs can manifest
radiation errors from SEEs differently. The main source of
radiation concerns for SRAM-based FPGAs is corruption in
the device-routing configuration memory and app-oriented
block RAMs. Configuration memory allows the FPGA to
maintain its pre-programmed, architecture-specific design;
therefore, an upset to configuration memory can
dramatically change the desired function of the device.
These memory structures along with flip-flops are
particularly vulnerable to radiation. To counter errors with
radiation effects, designers employ configuration memory
scrubbing. Scrubbing is the process of quickly repairing
configuration-bit upsets in the FPGA before they render the
device inoperable [8]. Additionally, designers use Error-
Correction Codes (ECC) and parity schemes for block
RAMs and some FPGA configuration memory. Finally, a
common approach is to triplicate design structures in the