StageNetSlice: A Reconfigurable Microarchitecture Building Block for Resilient CMP Systems Shantanu Gupta Shuguang Feng Amin Ansari Jason Blome Scott Mahlke Advanced Computer Architecture Laboratory University of Michigan Ann Arbor, MI 48109 {shangupt, shoe, ansary, jblome, mahlke}@umich.edu ABSTRACT Although CMOS feature size scaling has been the source of dra- matic performance gains, it has lead to mounting reliability con- cerns due to increasing power densities and on-chip temperatures. Given that most wearout mechanisms that plague semiconductor devices are highly dependent on these parameters, significantly higher failure rates are projected for future technology generations. Tra- ditional techniques for dealing with device failures have relied on coarse-grained redundancy to maintain service in the face of failed components. In this work, we challenge this practice by identify- ing its inability to scale to high failure rate scenarios and inves- tigate the advantages of finer-grained configurations. We use this study to motivate the design of StageNet, an embedded CMP archi- tecture designed from its inception with reliability as a first class design constraint. StageNet relies on a reconfigurable network of replicated processor pipeline stages to maximize the useful life- time of the chip, gracefully degrading performance toward end of life. This paper addresses the microarchitecture of the basic build- ing block of StageNet, named StageNetSlice, which is a processor core comprised of networked pipeline stages. A naive slice design results in approximately 4X slowdown verses a traditional proces- sor due to longer communication delays in the pipeline. However, several small design changes that eliminate inter-stage communi- cation paths and minimize communication bandwidth reduce this overhead to 11% on average while providing high levels of fine- grain adaptability. Categories and Subject Descriptors B.8.1 [Hardware]: Reliability, Testing and Fault-Tolerance; C.1.0 [Computer System Organization]: Processor Architecture General Terms Design, Reliability, Performance Keywords Multicore, Reliability, Architecture, Pipeline Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. CASES’08, October 19–24, 2008, Atlanta, Georgia, USA. Copyright 2008 ACM 978-1-60558-469-0/08/10 ...$5.00. 1. INTRODUCTION Device scaling trends into the nanometer regime have lead to increasing current and power densities and rising on-chip temper- atures, resulting in increasing device failure rates. Leading tech- nology experts have begun to warn designers that device reliability will begin to deteriorate from the 65nm node onward [7]. Current projections indicate that future microprocessors will be composed of billions of transistors, many of which will be unusable at manu- facture time, and many more which will degrade in performance (or even fail) over the expected lifetime of the processor [10]. To as- suage these reliability concerns, computer designers must directly address reliability in computer systems through innovative fault- tolerance techniques. The sources of computer system failures are widespread, rang- ing from transient faults, due to energetic particle strikes [40] and electrical noise [37], to permanent errors, caused by wearout phe- nomenon such as electromigration [13] and time dependent dielec- tric breakdown [39]. In recent years, industry designers and re- searchers have invested significant effort in building architectures resistant to transient faults and soft errors. Though there is signifi- cant evidence suggesting a growing rate of soft errors in future tech- nology generations [10], this problem is actively being addressed in research [27, 27, 28, 38]. In contrast, much less attention has been paid to the problem of permanent faults, specifically transistor wearout due to the degrada- tion of semiconductor materials over time. Concerns about wearout are primarily due to increasing power and current densities, both of which lead to increasing on-chip temperatures. All three of these parameters have been shown to heavily influence most wearout mechanisms [4]. In fact, most wearout mechanisms exhibit an ex- ponential dependence on temperature [17] [13] [32]. Furthermore, device scaling increases the susceptibility to wearout by shrink- ing the thickness of the gate and inter-layer dielectrics and increas- ing interconnect current density. Traditional techniques for dealing with transistor wearout have involved extra provisioning in logic circuits, known as guard-banding, to account for the expected per- formance degradation of transistors over time. However, the in- creasing degradation rate projected for future technology genera- tions implies that traditional margining techniques will be insuffi- cient. This necessitates revolutionary new designs for systems that can identify and adapt to wearout through reconfiguration. The challenge of tolerating permanent faults can be broadly di- vided into three requisite tasks: fault detection, fault diagnosis, and system recovery/reconfiguration. Fault detection mechanisms [8, 22, 5] are used to identify the presence of a fault, while fault di- agnosis techniques [21, 16, 12] are used to determine the source and nature of the fault. System recovery can consist of a number