AN EFFICIENT ERROR-MASKING TECHNIQUE FOR IMPROVING THE SOFT-ERROR ROBUSTNESS OF STATIC CMOS CIRCUITS Srivathsan Krishnamohan and Nihar R. Mahapatra E-mail: {krishn37, nrm}@egr.msu.edu Department of Electrical & Computer Engineering, Michigan State University, East Lansing, MI 48824, USA Abstract— Soft errors are functional failures resulting from the latching of single-event transients (transient voltage fluctuations at a logic node or SETs) caused by electrical noise or high-energy particle strikes. Due to technology scaling and reduced supply voltages, they are expected to increase by several orders of magnitude in logic circuits in the near future. Existing circuit and architectural solutions are inadequate because they have appreciable area/cost, performance, and/or power overheads. We present a very efficient and systematic error-masking technique for static CMOS combinational circuits that prevents an SET pulse, with width, in the worst case, less than approximately half of the timing slack available in its propagation path, from latching and turning into a soft error. The SET is masked without additional delay and within the clock cycle time in an area- efficient manner, which makes this technique applicable to commodity as well as reliability-critical applications. Application of this technique to ISCAS85 benchmark circuits yields average soft-error rate reduction of 75.71% with average area overhead of only 18.14%. I. INTRODUCTION Soft errors are functional failures resulting from the latching of single-event transients (transient voltage fluctuations at a logic node or SETs) caused by electrical noise or external radiation. In this paper, we are concerned in particular with soft errors due to high-energy neutron strikes (especially, in our soft-error rate (SER) analysis), which are a significant source of such errors. These errors pose increased reliability problems to nanometer-scale circuits due to reduced source/drain capacitances and supply voltages [6]. In designs realized with current bulk CMOS technologies, memory structures such as SRAM contribute most to the SER of the chip. Recent studies have shown that SER per chip of logic circuits will increase nine orders of magnitude when minimum feature size scales from 600 nm to 50 nm, becoming comparable to SER per chip of unprotected memory elements [17]. This necessitates an efficient design approach for static CMOS combinational circuits that would make them soft-error resilient without adversely affect- ing other design considerations such as power, performance, and cost. Traditional techniques to provide soft-error tolerance rely on triple modular redundancy (TMR), in which the original circuit is triplicated and a majority voter used to determine the final output. However, this requires high area and power overheads (> 200%) and performance penalty, which limits its usage to reliability-critical applications. Various ideas for soft-error tolerance based on time redundancy were presented in [15]. The time domain majority voter presented in [15] has a performance overhead since the sampling is started after the longest path in the circuit settles. Hence, an online error detection and retry procedure was considered better [1]. Online or concurrent error detection can be achieved by using self- checking circuits [11], [12] or by exploiting temporal redundancy of signals [1]. Self-checking circuits for arbitrary logic functions may require high hardware overhead. Online error detection and This research was supported by US NSF grant # 0102830. retry may affect performance (throughput) and cannot be used in real-time systems to overcome transient faults due to electrical noise or external radiation. Another technique called partial error masking corrects errors with lower overheads than traditional TMR techniques by utilizing the difference in soft error vulnerabilities of gates [13]. But, it has higher overhead compared to the technique presented in this work. Prior efforts have also focused on latch design for mitigating soft errors [4], [7] and combinational logic design for preventing transient pulse spreading [2]. Our technique uses a common delay line across an entire module (or modules) as opposed to using delay elements within each latch as proposed in [7]. The latch design presented in [4] requires resistor insertion to slow down the latch input stage, which incurs both performance and area penalty. Time redundancy based architectural approaches also have significant performance and power overheads and design time cost [14]. Thus there is a need for techniques that reduce soft errors efficiently. In this paper, we present an efficient error-masking design technique for static CMOS combinational circuits that exploits the inherent temporal redundancy (timing slack) of logic signals to increase soft-error robustness. It has a number of features that make it attractive compared to existing approaches: (1) It triplicates or modifies only the primary output (PO) gates of a CLB and thus has lower area and power overheads. (2) Further helping lower these overheads is the use of a common delay line for an entire CLB or even multiple CLBs to produce control signals used in the technique. (3) In CLBs that have sufficient slack at a significant fraction of the PO gates, which is quite common, SER can be reduced markedly without any performance overhead. Otherwise, SER can be reduced with some performance overhead. (4) Within the framework of this technique, it is possible to trade-off SER reduction with area, performance, and power overheads. The remainder of the paper is organized as follows. Sec. II explains our error-masking technique in detail with the circuits used to achieve this. Sec. III describes the simulation setup and presents results obtained with ISCAS85 circuits, and finally, Sec. IV concludes. II. TIME REDUNDANCY BASED ERROR MASKING A. Exploiting Timing Slack We first analyze the soft-error vulnerability of a CLB in the original circuit, and then, in the next paragraph, explain our technique conceptually and analyze how it exploits timing slack to reduce SER. All time instants in the following discussion are specified in terms of elapsed time after a cycle begins. Let T denote the cycle time. When an SET pulse is generated at the output of a static CMOS gate in a combinational circuit due to a high-energy particle strike, it may propagate to a PO gate u’s output and be captured by an output flip-flop (FF), and thus cause a soft error. At