Reliability Effects of Process and Thread Redundancy on Chip Multiprocessors Dakai Zhu Hakan Aydin Department of Computer Science Department of Computer Science University of Texas at San Antonio George Mason University San Antonio, TX, 78249 Fairfax, VA 22030 dzhu@cs.utsa.edu aydin@cs.gmu.edu I. I NTRODUCTION AND BACKGROUND The phenomenal performance gains in successive computer technologies have been obtained at the cost of drastic increases in power densities. This fact promoted energy to a first-class sys- tem resource, and energy-aware computing has recently become a major research area. At the same time, with the continued scaling of CMOS technologies and reduced design margins, VLSI circuits have become more susceptible to transient faults that are induced by energic particles (e.g., neutrons and alpha particles), and today, reliability concerns are pronounced more strongly for all computing systems [7]. The widely popular energy management technique, dynamic voltage scaling (DVS) has been shown to have direct and negative effects on the system reliability due to the increased transient fault rates [2], [11]. Therefore, there is an interesting tradeoff between system reliability and energy efficiency. For reliable systems that have limited energy budgets, exploring energy-efficient fault tolerance techniques becomes a necessity. Transient faults (also called soft errors), caused by high- energy particles in computing systems, have been studied ex- tensively, especially for memory sub-systems due to the fact that it is relatively easy to detect and model such errors for memory circuits. Error detection/correction schemes (such as parity and SEC-DED codes) have been proposed to enhance the data integrity. However, such techniques require significant space redundancy and complex error-correcting circuits increase memory access times, limiting their usage, especially for L1 caches [2]. Moreover, a recent model predicts that, with tech- nology advancements and reduced feature sizes, the transient fault rate in combinational logic circuits will be comparable to that of memory elements [9]. Simultaneous multithreading (SMT) [10] and chip multipro- cessor (CMP) [5] architectures were originally proposed to increase the system performance. Recently, they have been also explored for fault tolerance purposes, mainly to enhance system reliability through their inherent hardware redundancies [3], [8]. The main idea is to create/run a duplicated thread, on the same or a different core, simultaneously to detect and/or recover from transient faults. With the support of the operating system, one can also migrate/group threads to execute them on a certain core for managing power density in a system [6]. Very recently, an architecture-level power-efficient fault tolerance scheme is proposed to exploit redundant cores in multicore systems, where the redundant executions for verification are performed at lower frequencies for energy savings [7]. By utilizing the idle cores to duplicate part of computation in array-intensive applications, [1] demonstrates the tradeoff among performance, power and reliability. II. THREAD VS.PROCESS DUPLICATION However, all previous work has focused on thread level duplication (TLD), where the duplicated threads share common data/code segments. Although error-checking circuits in mem- ory cells could detect (and possibly correct) a single transient fault, multiple transient faults could create corrupt data blocks in cache or main memory [2]. These, in turn, can affect duplicated threads and result in a system failure. In this work, we illustrate the trade-off between system reliability and energy consumption with different redundancy granularities (specifically, TLD and process-level duplication (PLD)). Moreover, by taking into account the effects of DVS on transient fault rates [11], the challenging problem of exploring CMP architectures for both reliability and energy efficiency is identified and possible research directions are discussed. As an example, consider an application that is duplicated with two threads running on a dual-core CMP system. In Figure 1, the dotted rectangle represents the CMP and each core has two thread running contexts (TRC). Suppose that two dupli- cated threads, represented by the shaded and darkened TRCs, respectively, are scheduled to run in the same core (Figure 1a). We can put the second core and unused memory blocks (shown as blank) to a low-power sleep state for energy savings, [2]). However, in this case, any undetected transient faults in the memory sub-system (i.e., the shaded L1 cache which is in use, L2 cache or main memory blocks that are referenced) could lead to an error for both threads. Consequently, we may have a false positive (for the case where two threads form a duplex system to detect faults) or a failure (for the case where sanity check is performed at the end of the application for fault detection). L 2 L 1 L 1 TRC TRC TRC TRC Core Core M M M M L 2 L 1 L 1 TRC TRC TRC TRC Core Core M M M M a. Thread level duplication b. Process level duplication Fig. 1. Computation duplication using threads vs. processes Assume that the probability of having transient fault(s) in data blocks in use is given by ρ m . Similarly, ρ c denotes the probability of having transient fault(s) affecting the execution of one thread on the chip. For the case where sanity checks are