Analyzing the Optimal Voltage/Frequency Pair in Fault-Tolerant Caches Vicente Lorente 1 , Alejandro Valero 1 , Salvador Petit 1 , Pierfrancesco Foglia 2 , and Julio Sahuquillo 1 1 Department of Computer Engineering 2 Dipartimento di Ingegneria dell’Informazione Universitat Polit` ecnica de Val` encia Universit` a di Pisa Valencia, Spain Pisa, Italy vlorente@disca.upv.es, alvabre@gap.upv.es foglia@iet.unipi.it {spetit, jsahuqui}@disca.upv.es Abstract—When the processor works at very-low voltages to save energy, failures in SRAM cells increase exponentially at voltages below VCCmin. In this context, current SRAM-error detection and correction proposals incur on a signiﬁcant per- formance penalty since they increase access latency and disable cache lines that cannot be corrected, so decreasing the effective cache capacity. This reduction implies more cache misses, so enlarging the execution time which, contrary to expected, can turn in higher energy consumption. This paper characterizes SRAM failures at very-low voltages and presents an evaluation methodology to analyze the impact on energy consumption of error correction approaches. To do so, several voltage/frequency pairs are studied and the optimal pair is identiﬁed from an energy point of view. To focus the research, experimental results have been obtained for the recently proposed fault-tolerant HER cache. Results show that, for a 32nm technology node, the voltage/frequency pair of 0.45V/800MHz, which induces by 31% SRAM failure rate, provides the lowest overall energy consumption (by 62% energy savings compared to a non-faulty conventional cache). I. I NTRODUCTION Current microprocessors support multiple power modes to exploit the trade-off between performance and power. In order to speedup the execution time, in high-performance modes the processor enables a high frequency which makes use of a high voltage level. In low-power modes, low voltage/frequency levels are used for energy savings. Microprocessor caches are typically implemented with fast Static Random-Access Memory (SRAM) cells. Parameter vari- ations due to imperfections in the fabrication process increase as transistor features continue shrinking in future technologies. This makes SRAM memory cells more unreliable at low volt- ages because process variation induces Static Noise Margin (SNM) variability in such cells, which causes failures [1] (known as hard errors) in some of them when working below a certain reliable voltage level, namely VCC min . To increase reliability in SRAM cache arrays, several tech- niques have been used by industry [2] as row/column redun- dancy or Error Detection/Correction Codes (EDC/ECC). How- ever multi-bit error correction codes have high overhead [3] because they need additional storage for correction codes as well as complex and slow decoders to identify errors. Other SRAM fault-tolerant solutions basically allow the system to work below VCC min by disabling those segments of the cache where one or more bits fail, thus reducing the effective storage capacity [3]–[8]. Moreover, the highest fault coverage achieved by these techniques is below 10%, which makes them unsuitable for fault-dominated future technology nodes. On the other hand, embedded Dynamic RAM (eDRAM) cells [9] have emerged in recent processors [10] [11] to build low-level caches since they allow high density and low power consumption. An interesting feature of these cells is that hard errors basically lump into the cell retention time instead of altering the stored value, thus variation problems can be addressed in eDRAM by increasing the refresh rate. To deal with performance and hard errors, both SRAM and eDRAM cells have been recently combined to implement a hybrid Hard Error Recovery (HER) L1 data cache architecture [12], which is able to support 100% of SRAM faulty cells in low-power modes. Nevertheless, the HER cache also presents performance penalties at very-low voltages, since its effective SRAM stor- age capacity is severely reduced (up to 90% failure rate). In addition, the retention time of the eDRAM cells is also affected, which implies that eDRAM cell contents are lost faster causing noticeable rises in the miss ratio. Finally, average access time also increases due to the higher latency of eDRAM technology. In summary, existing SRAM fault-tolerant proposals incur on a signiﬁcant performance penalty since they increase access latency and reduce the effective cache capacity when working at low-power modes. At very-low voltages, the execution time can dramatically grow due to these effects, so extra energy is required to complete the program execution. Moreover, low voltages are necessarily paired with low processor fre- quencies, extending the cycle time in such a way that the execution time can be critically enlarged. Unfortunately, this can imply not only performance loss but also higher energy consumption with respect to higher voltage/frequency pairs. Therefore, despite the processor is working in a low-power mode and voltage is reduced for energy savings, the total energy consumption can exceed that consumed with a higher