Exploring the Synergy of Emerging Workloads and Silicon Reliability Trends Marc de Kruijf Karthikeyan Sankaralingam Department of Computer Science University of Wisconsin-Madison Vertical Research Group (vertical@cs.wisc.edu) Abstract—Technology constraints and application character- istics are radically changing as we scale to the end of silicon technology. Devices are becoming increasingly brittle, highly varying in their properties, and error-prone, leading to a fun- damentally unpredictable hardware substrate. Applications are also changing, and emerging new classes of applications are increasingly relying on probabilistic methods. They have an inherent tolerance for uncertainty and can tolerate hardware errors. This paper explores this synergy between application error tolerance and hardware uncertainity. Our key insight is to expose device-level errors up the system stack instead of masking them. Using a compiler instrumentation-based fault-injection method- ology, we study the behavior of a set of PARSEC benchmarks under different error rates. Our methodology allows us to run programs to completion, and we quantitatively measure quality degradation in the programs’ output using application-specific quality metrics. Our results show that many applications have a high tolerance to errors. Injecting errors into individual static instructions at rates of 1% and higher, we find that between 70% to 95% of those instructions cause only minimal degradation in the quality of the program’s output. Based on a detailed analysis of these programs, we propose light-weight application-agnostic mechanisms in hardware to mitigate the impact of errors. I. I NTRODUCTION Advances in semiconductor manufacturing technology have enabled consistent, progressive reductions in the size of on- chip devices, resulting in exponential growth in the number and speed of devices on chip and the overall performance of microprocessors. Through the next decade and until the end of CMOS technology this device scaling is expected to continue. However, the underlying properties of these devices are radically changing, and we are entering an era of non- ideal process scaling and unpredictable silicon technology which is causing disruptive changes at many layers of the microelectronics system stack [3], [14]. Currently computer systems are designed assuming perfection at many levels, from devices, through CAD, the microarchitecture-level and ISA. Driven by scaling, however, device-level permanent and transient errors, aging, and variability become first order con- straints [15]. In the future, it may be too hard to maintain this illusion of perfection. The model of hardware being correct all the time, on all regions of chip, and forever may become prohibitively expensive to maintain. The technology-driven question that follows this observation is: how can we build working hardware from unpredictable silicon? New classes of high-performance applications are emerging that are dominated by probabilistic algorithms used for image recognition, video search, text and data mining, modeling virtual worlds, and games [6], [16]. Many of these applications share an inherent ability to tolerate uncertainty and provide an opportunity to creatively utilize hardware. While they require large computational capability, they provide the possibility that hardware does not have to be always correct. Hence, the key application-driven questions are: (1) can these applications indeed tolerate hardware uncertainty and (2) how can we efficiently support the massive computation needs of these emerging applications? A rich body of literature has focused on building a fault- free machine abstraction model and implementation in the presence of device errors [13]. In contrast, we observe that emerging applications are inherently tolerant of errors and are algorithmically designed to operate on noisy data. We make a big departure from prior work and treat hardware errors that create computation errors as effectively rendering the input data noisy. Our work is motivated by the insight that exposing device errors through the system stack can become more efficient than masking these errors when devices become highly unpredictable. The cross-over point when masking become less efficient than exposing is open to question and is highly dependent on technology constraints. Figure 1a illustrates the trade-off. This work attempts to explore an architecture space where hardware is allowed to execute with errors and these errors are simply exposed to applications and the system, allowing for higher level software decisions on managing these errors. While error rates are manageable today, non-ideal process scaling is likely to increase error rates and hence now is the time to explore such a design space. In this paper, we motivate the need for such an abstraction by outlining technology trends and application trends, focusing on the synergy between them. We present a comprehensive analysis and characterization of emerging workloads and attempt to quantify their toler- ance to errors. We analyze a set of PARSEC benchmarks (x264, bodytrack, canneal, streamcluster, swaptions) [2] and a futuristic realtime ray-tracer called Razor [5], [11]. We use a LLVM-based [9] toolchain to probabilistically inject errors into applications and run full applications on real hardware. We also develop application-specific quality functions that measure how an application’s output degrades due to hard-