Using Input-to-Output Masking for System-Level Vulnerability Estimation in High-Performance Processors Alireza Haghdoost 1 Hossein Asadi 1 Amirali Baniasadi 2 1 Department of Computer Engineering, Sharif University of Technology, Tehran, Iran 2 Department of Electrical and Computer Engineering, University of Victoria, Canada haghdoost@ce.sharif.edu asadi@sharif.edu amirali@ece.uvic.ca Abstract—In this paper, we enhance previously suggested vulnerability estimation techniques by presenting a detailed modeling technique based on Input-to-Output Masking (IOM). Moreover we use our model to compute the System-level Vul- nerability Factor (SVF) for data-path components in a high- performance processor. As we show, recent suggested estimation techniques overlook the issue of error masking, mainly focusing on time periods in which an error could potentially propagate in the system. In this work we show that this is incomplete as it ignores the masking impact. Our results show that including the IOM factor can significantly affect the system-level vulnerability for data-path components. As a case study, we analyze the IOM factor for CPUs with different configurations. Our results show that the average variation of the IOM factor is less than 5%. Meantime, the IOM factor varies between 24% to 76% for the applications studied here. Accordingly we find the IOM factor to be less configuration dependent and mainly workload dependent. Index Terms—System Level-Vulnerability, Architectural Vul- nerabulity Factor, High-Performance Processors, Fault Masking Factor. I. I NTRODUCTION The data integrity of high-end and mainstream processors is threatened by cosmic and terresial energetic particles such as neutrons and alpha particles from packaging materials. These energetic particles can change the state of storage elements such as flip-flops and SRAM cells within processors and cause a transient error. The radiation-induced transient errors, also called soft errors, occur more often than hard errors in the current VLSI technology [1], [2]. Recent research study has shown that soft errors could have significant impact on the data integrity of the current microprocessor technology [3]. As technology continues to scale down and the number of transistors per chip continues to move up, the soft error rate per chip is expected to increase for the next several years [1]. Accordingly, designers would need to incorporate aggressive protection techniques in future microprocessor designs. An important aspect of designing cost-effective protection tech- niques is developing accurate soft error vulnerability models for individual components. This will help understanding the extend of vulnerability for data-path components such as cache, register files, and load/store queues before developing protection techniques. Having an accurate model for such components would facilitate making informed decisions about the level of protection needed across data-path components and target workloads. The right protection level for data-path structures reduces data loss probability and therefore would increase system reliability. Recent field study over several thousands of systems indi- cates that in the current processor technology, a majority of system reboots are initiated by single event upsets (or SEUs) occurring in data-path components such as cache and register files [3]. Errors in such structures can easily propagate to the system outputs and can significantly reduce the system reliability. In particular, cache reliability comes with high importance as errors occurring in the data cache can propagate to higher memory levels, and can easily lead to data integrity issues [4], [5]. While designing caches with low access time and miss rate is an important goal, maintaining low power dis- sipation and high reliability have also become necessary. This is particularly true for high-end and mainstream processors where reliability has always been a vital concern. Previous studies have introduced analytical models to com- pute vulnerability of data-path components such as cache and register file to SEUs [6], [7], [8], [9]. Such models often provide fast estimation but suffer from inaccuracies as the system-level impact of soft errors are not taken into account in these models. More accurate measurements, i.e., fault injection (FI) strategies [10], [11], [12], [13], are both time-consuming, due to the large number of runs, and still prone to inaccuracy, due to the limited number of addresses targeted. The goal of this study is to introduce a new vulnerability es- timation technique to improve accuracy of previous estimation methods and maintain low estimation time. We do so by taking into account an important parameter ignored by earlier studies. Previous studies mainly rely on measuring the time period in which an error occurring in a data block could potentially propagate in the system, also referred to as the critical time, to estimate vulnerability. While critical time is an important factor, it is not the only one. In this work, we present a modeling technique based on the Input-to-Output Masking (IOM) factor. We define the IOM factor of a component as the percentage of errors masked when propagating erroneous values from the inputs to the outputs of the component. We present a technique to compute the IOM factor of components for a high-performance processor. Using the IOM factor, we also present a modeling technique to estimate the Component-level Vulnerability Factor (CVF) and the System-level Vulnerability Factor (SVF) of the data- path components of a high-performance processor. We define 91 978-1-4244-6268-8/10/$26.00 ©2010 IEEE