High Performance Computing on Fault-Prone Nanotechnologies: Novel Microarchitecture Techniques Exploiting Reliability-Delay Trade-offs * Andrey V. Zykov, Elias Mizan, Margarida F. Jacome, Gustavo de Veciana, Ajay Subramanian Department of Electrical and Computer Engineering, The University of Texas at Austin. {zykov,mizan,jacome,gustavo,ajay}@ece.utexas.edu ABSTRACT Device and interconnect fabrics at the nanoscale will have a den- sity of defects and susceptibility to transient faults far exceeding those of current silicon technologies. In this paper we introduce a new performance optimization dimension at the microarchitec- ture level which can mitigate overheads introduced by fault toler- ance. This is achieved by directly exposing reliability versus delay design trade-offs while incorporating novel forms of speculation which use faster but less reliable versions of a microarchitecture’s performance critical components. Based on a parameterized mi- croarchitecture, we exhibit the benefits of optimizing these trade- offs. Categories and Subject Descriptors: C.1 [Computer Systems Organization] Processor Architectures, Performance of Systems, B.8 [Hardware] Performance and Reliability General Terms: Performance, Design, Reliability. Keywords: Nanotechnologies, Fault Tolerant Microarchitectures, Performance Optimization, Reliability-Delay Trade-offs. 1. INTRODUCTION Recent striking successes in devising and assembling nanoelec- tronic devices suggest that the ability to build large scale nanofab- rics for computation is now on the 10–15 year horizon [1, 2, 3]. Nanotechnologies based on carbon nanotubes and silicon nanowires are particularly promising. Nanotube switches can theoretically op- erate at unprecedented speeds, e.g., 100–200GHz, while nanowire junction arrays can be configured as OR, AND, and NOR logic gates, with gain, and thus be used to realize basic computation[2]. Although many challenges lie ahead, many predict that it will be possible to assemble workable computer memory and logic devices from nanoscale building blocks before silicon devices hit their lim- its[1]. As such, it is critical to start investigating the design meth- ods and computing system architectures required to take these tech- nologies into design/production environments [4, 5]. Irrespective of the ‘winning’ (charge carrier transport-based) nan- otechnologies, it is widely recognized that devices and intercon- nects at the nanoscale will exhibit fault densities much greater than state-of-the-art silicon technology. Indeed, they: (1) will have a density of defects which is much higher than current silicon tech- nologies [1]; and (2) are likely to be much more susceptible to tran- sient faults (soft errors) [1]. These increases are, in part, due to the * This work is supported in part by SRC Grant CRS 1152.001 and NSF Grant CCR 031019 Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. DAC 2005, June 13–17, 2005, Anaheim, California, USA. Copyright 2005 ACM 1-59593-058-2/05/0006 ...$5.00. physical dimensions being considered. From a materials perspec- tive, decreasing the size of structures increases the ratio of surface area to volume, making imperfections in materials interfaces more critical to the proper function of interconnects and devices. Further- more, at such reduced scales, the discrete nature of atomic matter and charge becomes significant. Namely, a single charge or defect may significantly impact the structural stability of a nanodevice, as well as its timing/performance characteristics and sensitivity to fluctuations in the local electrostatic environment (electric noise). These observations point to a reliability problem which is intrinsic to nanoscale regimes. In this paper we formalize the problem of exploring reliability– delay trade-offs at the microarchitecture level, for performance en- hancement. This is achieved by a novel form of reliability-driven speculation relying on faster but less reliable versions of a microar- chitecture’s performance critical components or stages. Based on a simple parameterized microarchitecture, we exhibit the benefits of explicitly exploring this novel class of trade-offs at the microarchi- tectural level. 2. RELIABILITY-DELAY TRADE-OFFS AT THE MICROARCHITECTURE LEVEL For the purpose of this research we will use a parametric model that captures how a component’s delay might scale with its desired reliability. The reliability of a component is the probability it per- forms its function correctly on a given use. Our model is based on fundamental considerations on how reliability increases with re- dundancy, and how the increased area associated with redundancy would lead to higher delays – see [6] for details. This strongly suggests that highly fault-tolerant microarchitecture component de- signs for nanotechnologies will incur substantial delay overheads. Here then lies the fundamental question addressed in this paper: can one effectively ‘hide’ the performance overheads incurred by fault-tolerant component designs, at the microarchitecture level? We will show that this is indeed the case. A new set of tradeoffs at the microarchitecture level. The key novelty of our work is the introduction of a new performance opti- mization dimension in microarchitectural design. This is achieved by exposing reliability-delay trade-offs through novel forms of spec- ulation relying on faster but less reliable versions of a microarchi- tecture’s performance critical components. Architects will need to perform design space exploration to identify the most favorable reliability-delay tradeoff for each ‘speculative’ component. We shall call this broad class of techniques reliability-driven specula- tion and microarchitectures enhanced with such features reliability- aware (RA) microarchitectures. Selection of baseline microarchitecture. The principle of reli- ability-driven speculation, and associated performance optimiza- tion, can be applied to essentially any architecture, including EPIC/ VLIW, dataflow, etc. In this paper, we will demonstrate its im- pact on out-of-order (OOO) superscalar processors. Given the sub- 15.5 270