A Soft Error Emulation System for Logic Circuits Sandip Kundu ∗ Matthew D.T. Lewis Ilia Polian Bernd Becker Department of ECE University of Massachusetts Amherst, MA 01003, USA kundu@ecs.umass.edu Albert-Ludwigs-University Georges-K¨ ohler-Allee 51 79110 Freiburg i. Br., Germany {lewis|polian|becker}@informatik.uni-freiburg.de Abstract In nanometer technologies, soft errors in logic circuits are increasingly important. Since the failure in time (FIT) rates for these circuits are very low, millions of test vec- tors are required for a realistic analysis of soft errors. This exceeds the capabilities of software simulation tools. We propose an FPGA emulation architecture that can ap- ply millions of vectors within seconds. Comprehensive soft error profiling was done for ISCAS 89 circuits. Soft errors were assigned to four different classes, and their la- tency and recovery time were obtained. This information is useful for understanding the vulnerability of the system to soft errors and hardening it against such errors. 1 Introduction A transient fault causes a circuit node to glitch but does not cause any permanent damage. Such faults typically occur due to ionizing radiation from α-particles or cosmic rays [1]. Since a transient fault occurs due to an external event outside the control of logic operation, these faults are not repeatable (i.e., the fault may occur during a test run and not occur during a second identical test run). The probability that a single node upset will cause a circuit failure is small in many instances. In the litera- ture, the terms “transient fault” and “soft error” have been used interchangeably. This causes some confusion. We define a transient fault as a single or multiple node(s) up- set directly attributable to excess charge carriers induced by external radiation (the terms Single Event Upset (SEU) and Single Event Transient (SET) are used in the Nuclear Science literature [2]). By contrast, we define soft error as an impact of a transient fault that may persist beyond a cycle. For the impact to transfer from a clock phase to another, the impact must be captured in a storage element such as a latch or memory. Thus, soft error may manifest away from the location of the strike. Therefore, transient faults and soft errors can be viewed as cause and effect. As mentioned earlier, a transient fault is not repeat- able. During manufacturing test, a soft error observed by * This work was performed while the author was a guest professor at the Albert-Ludwigs-University of Freiburg. a tester is indistinguishable from any other error source such as a permanent defect. These parts will be discarded. However, passing manufacturing test does not mean that transient faults will not occur during actual operation. On the contrary, process variation may make one circuit part more vulnerable to soft errors than another. To distinguish between parts based on soft error susceptibility different kinds of tests are needed. Such tests will apply patterns from a tester repeatedly and measure the mean time to failure (MTTF) to estimate the rate of failure in time (FIT) rate. FIT rate is defined as the number of failures per bil- lion hours of operation, i.e. 114,000 years. Thus, FIT rate of 1 would imply that MTTF is 114,000 years. A typical FIT rate of a today’s commercial semiconductor circuit is between 1 and 100. Consequently, MTTF is between 1,140 and 114,000 years. It is quite obvious that measuring a MTTF of thousands of years is impractical. Therefore, acceleration techniques are needed to measure MTTF [3]. In an accelerated sys- tem, the device under test (DUT) is exposed to much higher levels of radiation than found in nature. The MTTF is measured under accelerated conditions, and then this re- sult is scaled for lower levels of radiation. It is difficult to validate the scaling factors (calibration) [4]. Thus, FIT rates are often gross estimates. These estimates can be improved if such transient faults can be simulated. Since simulation is usually many orders of magnitude slower than actual hardware operation, it seems impractical as well. In this paper we study the applicability of emulation techniques to accelerate the simulation of transient faults. Please note that FIT rate is associated with a chip and not a circuit node. Typically the rate of transient faults is a few orders of magnitude higher than FIT rate. The relation between these two entities is a topic of investigation in this paper. As mentioned earlier, the main source of soft errors is radiation. Consequently, soft errors in devices with elevated radiation exposure were historically considered first, including medical [5] and aerospace applications [6, 7]. Soft errors in integrated circuits used in other fields are mainly due to cosmic radiation and the alpha particles emitted by the package (advances in packaging technol- ogy have almost eliminated the latter problem). Histor- ically, memories rather than random logic were studied