208 IEEE TRANSACTIONS ON DEVICE AND MATERIALS RELIABILITY, VOL. 10, NO. 2, JUNE 2010 Robust Register Caching: An Energy-Efficient Circuit-Level Technique to Combat Soft Errors in Embedded Processors Mahdi Fazeli, Student Member, IEEE, Alireza Namazi, and Seyed Ghassem Miremadi, Senior Member, IEEE Abstract—This paper presents a cost-efficient technique to jointly use circuit- and architecture-level techniques to protect an embedded processor’s register file against soft errors. The basic idea behind the proposed technique is robust register caching (RRC), which creates a cache of the most vulnerable registers within the register file in a small and highly robust cache memory built from circuit-level single-event-upset-protected memory cells. To guarantee that the most vulnerable registers are always stored in the robust register cache, the average number of read operations during a register’s lifetime is used as a metric to guide the cache replacement policy. A register is vulnerable to soft errors when it holds a value that will be used in subsequent cycles. Consequently, while a register value is stored in the register cache, it is robust against single- and multiple-bit upsets. To minimize the power overhead of the RRC, the clock-gating technique is efficiently ex- ploited by the main register file, resulting in significantly reduced power consumption. The RRC was experimentally evaluated using the LEON processor for two benchmarks, namely, the MiBench embedded benchmark suite and the SPEC CPU2006 general- purpose benchmark. Our experimental results show that if the cache size is selected appropriately, the architectural vulnerability factor (AVF) of the register file is significantly reduced while also of- fering the benefits of low power, area, and performance overheads. Index Terms—Embedded processors, fault tolerance, multiple- bit upsets (MBUs), register file, single-event upset (SEU). I. I NTRODUCTION H IGH-ENERGY particle strikes have severely challenged the reliability of today’s embedded processors. Until recently, single-bit upsets (SBUs) in memory elements were thought to be the main effect of particle strikes. However, as technology shrinks to the nanometer scale, multiple-bit upsets (MBUs) are causing serious reliability problems [1]–[5]. An MBU is defined as several adjacent bit upsets simultaneously caused by a single particle strike. In [1], it was reported that up to five bit upsets may occur in a 130-nm SRAM as a result of neutron particle collisions. In [6], it was estimated that for a commercial 130-nm SRAM, 7% of all adverse events are MBUs, of which 92% are double bit upsets and 8% are triple bit upsets. These results indicate that MBUs should be taken into account when designing reliable systems. Manuscript received August 22, 2008; revised November 28, 2009; accepted January 8, 2010. Date of publication February 5, 2010; date of current version June 4, 2010. An earlier version of this paper was presented at the Proceed- ings of the 39th Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN 2009) [38]. The authors are with the Department of Computer Engineering, Sharif University of Technology, Tehran 11365-9363, Iran (e-mail: m_fazeli@ce. sharif.edu; namazi@ce.sharif.edu; miremadi@sharif.edu). Digital Object Identifier 10.1109/TDMR.2010.2041234 The register file is a critical part of an embedded processor from a reliability perspective [7], [8]. This is because the register file holds useful data that are frequently accessed by the processor. This implies that an error in the register will most likely propagate to other parts of the processor. Several error- handling techniques have been proposed, which fall under two general categories. The first category contains techniques that are based on detection and recovery mechanisms, such as the use of parity bits and rollback recovery. Although these tech- niques are efficient in terms of power usage, they may violate the real-time requirements of embedded applications if recov- ery happens after certain task deadlines. The other category contains techniques that are based on error-correction codes (ECCs) or error-masking techniques such as TMR [9]–[11]. Protecting the entire register file using ECCs or error-masking techniques is not appropriate for embedded applications due to the high power consumption overhead. On the other hand, ECCs that can detect and correct MBUs require a larger number of check bits (or ECC bits) and therefore impose larger area and power overheads [12]. Moreover, reading and calculating ECC bits during each read operation can impose performance and power consumption overheads. The scenario gets even worse for register files with multiple read ports. To address this issue, an SEU-tolerant technique that protects only the most vulnera- ble registers of the register file with an SECDED code is pro- posed in [7] and [8]. This technique is based on the fact that not all of the registers are equally vulnerable to soft errors. Tech- niques that utilize the architectural vulnerability characteristics of the register file [7], [8], [13] are regarded as architecture- level techniques in this paper. However, even though the afore- mentioned technique offers lower power consumption overhead than that associated with protecting the entire register file with an SECDED code, it cannot efficiently cope with MBUs. One effective way to combat MBUs in the register file is to use circuit-level-protected SEU-tolerant memory cells in the register file [11], [14]–[16]. This is because if a register file is built up by such memory cells, as each memory cell is protected against particle strikes, the register file becomes highly robust against multiple particle strikes, i.e., MBUs. However, this approach imposes rather high power and area overheads if it is used to protect the entire register file. Although different architecture-level SEU-tolerant tech- niques based on ECCs or detection and recovery mechanisms have been proposed to protect register files, there are few studies that cost-efficiently integrate circuit- and architecture- level SEU-tolerant techniques such that they complement each 1530-4388/$26.00 © 2010 IEEE