A low-cost fault tolerant solution targeting commercial FPGA devices Kostas Siozios , Dimitrios Soudris Microprocessors and Digital Systems Lab School of Electrical and Computer Engineering, Department of Computer Science, National Technical University of Athens (NTUA), 9 Heroon Polytechneiou, Zographou Campus, 15780 Athens, Greece article info Article history: Received 10 December 2011 Received in revised form 26 June 2013 Accepted 30 September 2013 Available online 19 October 2013 Keywords: Fault tolerant FPGA Methodology CAD tool Reliability improvement abstract Technology scaling, in conjunction to the trend towards higher operation frequency, results in increased thermal stress, which in turn leads to upsets due to reliability degradation. In this paper, we introduce a software-supported framework targeting to enable sufficient fault coverage against upsets occurred due to aging phenomena. Experimental results with a number of industrial oriented DSP kernels shown the effectiveness of our framework, since we achieved average improvement in terms of maximum operation frequency and power consumption by 15% and 70%, respectively, as compared to a well-established com- mercial solution, for comparable fault masking. Ó 2013 Elsevier B.V. All rights reserved. 1. Introduction More than at any other time, global economics favor program- mable chips over costly Application-Specific Integrated Circuits (ASICs) and Application-Specific Standard Products (ASSPs), since the costs and risks associated with application-specific devices can only be justified for a short list of ultra-high volume commod- ity products. Hence, programmable platforms, and more specifi- cally Field-Programmable Gate Arrays (FPGAs), have become the only viable means for today’s companies to meet increasingly stringent product requirements – cost, power, performance, and density – in a business environment characterized by spiraling complexity, shrinking market windows, fickle market demands, capped engineering budgets, escalating ASIC and ASSP non-recur- ring engineering costs, and increased risk. Recently, a number of FPGA platforms with increased perfor- mance and densities were released from reconfigurable industry (e.g. Virtex-6, Virtex-7, EasyPath series, Stratix-IV, Stratix-V, etc.) [1,2]. Even though these devices exhibit superior system-level functionality, they still consume more power, as compared to the rest hardware implementations (e.g., ASICs, ASSPs) [3]. Meeting power, and consequently thermal budget, is an essen- tial criterion by which customers measure the success of their FPGA-based designs. The importance of this problem becomes far more savage because power density in FPGA devices is almost doubled every three years [4–6], while this trend is expected to be increased with technology scaling according to ‘‘A-power’’ law [7]. Among others, higher power and thermal stress imposes that target architecture is more likely to suffer from reliability degrada- tion, and hence to decrease the mean time between failures (MTBF) [8–11]. For instance, regarding commercial grade FPGAs, the maximum die temperature without performance degradation is reported as 80 °C, whereas the absolute maximum temperature is 125 °C [3]. Furthermore, based on previously published data, an average-sized design mapped onto a Virtex-E FPGA with 90% device utilization could lead to die temperature of 50 °C above the ambient temperature value [3]. In order to highlight the importance of reliability degradation, a number of fault tolerant systems have been proposed. These solu- tions allow a design to continue its operation, possibly at a reduced level, rather than failing completely, when some part of the system fails. Existing approaches in designing fault tolerant systems are applicable either in hardware, or software level. Previous analysis shown that hardware-based fault tolerance exhibits superior performance, as compared to alternative (software-based) imple- mentations, but they impose increased fabrication cost [12–20]. The derived platforms exhibit static fault masking, which is defined at fabrication time. On the other hand, potentially the software- based fault tolerant solutions combine the required dependability level with the low cost of commodity devices [21–25]. Apart from novel methodologies, a number of CAD tools that software support these methodologies, have also to be developed [21–23,25]. These approaches affect mainly academic solutions, while up to now the the only known commercially available framework for providing fault masking at FPGA, is the Xilinx Triple 1383-7621/$ - see front matter Ó 2013 Elsevier B.V. All rights reserved. http://dx.doi.org/10.1016/j.sysarc.2013.09.011 Corresponding author. Tel.: +30 210 772 3653; fax: +30 210 772 4305. E-mail addresses: ksiop@microlab.ntua.gr, ksiop@ee.duth.gr (K. Siozios). Journal of Systems Architecture 59 (2013) 1255–1265 Contents lists available at ScienceDirect Journal of Systems Architecture journal homepage: www.elsevier.com/locate/sysarc