5 th Mediterranean Conference on Embedded Computing MECO’2016, Bar, Montenegro Using Power Consumption in the Performability of Fault-Tolerant FPGAs Gehad I. Alkady Elect. and Comm. Eng. Department American University in Cairo Cairo, Egypt g.i.alkady@aucegypt.edu Nahla A. El-Araby Electrical and Electronics Eng. Department Canadian International College Cairo, Egypt nahla_alaraby@cic- cairo.com H.H. Amer Elect. and Comm. Eng. Department American University in Cairo Cairo, Egypt hamer@aucegypt.edu M.B. Abdelhalim College of Computing and Information Technology AASTMT Cairo, Egypt Abstract—Over the last decade, several Fault-Tolerant techniques for FPGAs were proposed especially for recovering from permanent faults. Most of those techniques were based on relocation of the defective module into a new location acting as a spare. Accordingly, what is the suitable number of spares that should be added to a system? In this paper, a performability model is developed to quantitatively investigate the appropriate number of spares The model uses power consumption as the penalty. An example with three Markov models is studied to show system designers how to find the appropriate tradeoff between increase in reliability and decrease in performability. Keywords-FPGAs; Fault-Tolerance; Reliability; Markov Model; Performability; Power. I. INTRODUCTION Field programmable gate arrays (FPGAs) play a great role in digital system implementation. They are widely used in many applications including medical, military, aerospace, and industrial applications. FPGAs have many advantages including high performance, high re-configurability which allows fast prototyping, and on-site hardware upgrades. Due to the recent trend of lowering gate length technologies, FPGAs became more prone to faulty behavior. This change induced some negative impacts on the reliability of the basic architecture of the FPGA [1]. Accordingly, Fault-Tolerant FPGA-based applications are required. Systems that can autonomously self-recover are called Autonomous Fault- Tolerant Systems (AFTSs) [2]. FPGAs are affected by two types of faults which are transient faults and permanent faults. Transient faults induce no permanent damage and can be recovered from by downloading the original bit file such as Single Event Upsets (SEUs) and Address Decoding faults. Many techniques are introduced to mitigate transient faults effects such as Triple Modular Redundancy (TMR) [3], Error Correcting Codes (ECC) [4]. Permanent faults cause enduring effects on the silicon of the FPGA. They include Time Dependent Dielectric Breakdown (TDDB) and open faults. Several techniques are proposed to recover from permanent faults including the technique in [5] which detects open faults in FPGA programmable interconnections using a low-cost fault detection technique which is the Duplication With Compare (DWC) technique. In addition, a fault recovery technique was proposed which is based on 2-to-1 multiplexer and a spare module, i.e, a second copy of the main module. If a fault exists, the output of the spare module is selected. In [6], the entire FPGA is reconfigured, when a permanent fault occurs with precompiled bit files, while in [7], the FPGA is partitioned into cells. Each cell contains only one module. Once a permanent fault occurs, all the modules one by one will be shifted repeatedly to the remaining cells, until all remaining cells in the FPGA are used. In [8], the FPGA is divided in to numerous reconfigurable tiles which are logic tiles, interconnection tiles, and recovery tiles. The design is relocated into a recovery tile when a permanent fault exists. In addition, the interconnection tile is reconfigured. In [9], a fault recovery technique based on the Partial Reconfiguration (PR) property [10] is proposed. It relies on identifying a Partially Reconfigurable block (  ) in the FPGA which acts as a spare for the first faulty module in the design. Most of the previously mentioned recovery techniques are based on relocating the faulty module to a new location which acts as a spare for the faulty module so the system can remain operational. But, what is the adequate number of spares should be added to the design of FPGA-based circuits. This question can be answered using three metrics which are reliability, power consumption, and performability [11]. In this paper, the query of how many spares should be added to a FPGA-based system to increase its reliability and life time, is studied via a simple case study. The recovery technique in [9] is used in the analysis. Continuous Time