A Low Overhead Error Confinement Method based on Application Statistical Characteristics Zheng Wang , Georgios Karakonstantis , Anupam Chattopadhyay School of Electrical and Electronic Engineering, Nanyang Technological University wangz@ntu.edu.sg School of Electronics, Electrical Engineering and Computer Science, Queens University Belfast g.karakonstantis@qub.ac.uk School of Computer Engineering, Nanyang Technological University anupam@ntu.edu.sg Abstract—Reliability has emerged as a critical design constraint especially in memories. Designers are going to great lengths to guarantee fault free operation of the underlying silicon by adopting redundancy-based techniques, which essentially try to detect and correct every single error. However, such techniques come at a cost of large area, power and performance overheads which making many researchers to doubt their efficiency especially for error resilient systems where 100% accuracy is not always required. In this paper, we present an alternative method focusing on the confinement of the resulting output error induced by any reliability issues. By focusing on memory faults, rather than correcting every single error the proposed method exploits the statistical characteristics of any target application and replaces any erroneous data with the best available estimate of that data. To realize the proposed method a RISC processor is augmented with custom instructions and special- purpose functional units. We apply the method on the proposed enhanced processor by studying the statistical characteristics of the various algorithms involved in a popular multimedia application. Our experimental results show that in contrast to state-of-the-art fault tolerance approaches, we are able to reduce runtime and area overhead by 71.3% and 83.3% respectively. I. I NTRODUCTION The aggressive shrinking of transistors have made circuits and especially memory cells more prone to parametric variations and soft errors that are expected to double for every technology generation [1], thus threatening their correct functionality. The increasing demand for larger on-chip memory capacity, predicted to exceed 70% of the die area in multiprocessors by 2017 is expected to further worsen the failure rates [2], thus indicating the need for immediate adoption of effective fault tolerant techniques. Techniques such as Error Correcting Codes (ECC) [3] and Checkpointing [4] may have helped in correcting memory failures, however they incur large area, performance and power overheads ending up wasting resources and contracting with the high memory density requirements. In an effort to limit such overheads recent approaches exploit the tolerance to faults/approximations of many applications [5] and relax the requirement of 100% correctness. The main idea of such methods is the restricted use of robust but power hungry bit-cells and methods such as ECC to protect only the bits that play a more significant role in shaping the output quality [6] [7]. Few very recent approaches exist also that extend generic instruction sets with approximation features and specialized hardware units [8][9][10]. Although such techniques are very interesting and showcased the available possibilities in certain applications they are still based on redundancy and have neglected to exploit some more fundamental characteristics of the application data. Contribution In this paper, we enhance the state of the art by proposing an alternative system level method for mitigating mem- ory failures and presenting the necessary software and hardware features for realizing it within a RISC processor. The proposed approach, instead of adding circuit level redundancy to correct memory errors tries to limit the impact of those errors in the output quality by replacing any erroneous data with the best available estimate of those data. The proposed approach is realized by enhancing a common programming model and a RISC processor with custom instructions and low cost hardware support modules. We demonstrate the low overhead and error mitigation ability of the proposed approach by applying it on the different algorithmic stages of JPEG and comparing with the extensively used Sin- gle Error Correction Double Error Detection (SECDED) method. Overall, the proposed scheme offers better error confinement since it is based on application specific statistical characteristics, while allowing to mitigate single and multiple bit errors with substantially less overheads. The rest of work is organized as following. Section II introduces the proposed approach while Section III describes the enhance- ments of a processor for realizing it. Section IV presents the statistical analysis of the proposed approach. Section V presents the simulation results. Finally, Section VI concludes the work. II. PROPOSED ERROR CONFINEMENT METHOD Assume that a set of data d ∈D = {d 1 ,...,d K } being produced by an application are distributed according to the probability mass function P d (d k ) = Pr(d = d k ). Such data are being stored in a memory, which is affected by parametric variations causing errors (i.e. bit flips) in some of the bit-cells. Sure errors eventually result in erroneous data leading to a new data distribution ¯ P dk . The impact of such faults can be quantified by using a relevant error cost metric which in many cases is the mean square error (MSE) defined as C ( ¯ d) E (d ¯ d) 2 (1) with the expectation taken over the memory input d. Our proposed method focuses on minimizing the MSE between the original stored data d and the erroneous data ¯ d in case of a-priori information about the error F through an error-mitigation function d * = g(F ) which can be obtained by solving the following optimization problem: d * = g(F ) arg min ¯ d C ( ¯ d |F ). (2) where, C ( ¯ d |F ) E (d ¯ d) 2 |F (3) Basic arithmetic manipulations show that the resulting correc- tion function is given by g MMSE = E{d[n] |F}. This essentially corresponds to the expected value of the original fault-free data. Such expected values can be eventually determined offline through Monte-Carlo simulations or analytically in case that the reference data distribution is known already as in many DSP applications. Note that the above function depends on the applied cost metric that is relevant for the target application and other functions may exist that can be found by following the above procedure. In our paper, we focus on MSE which is relevant for many applications and especially for our case study that we discuss later. III. REALIZING THE PROPOSED ERROR CONFINEMENT IN A RISC PROCESSOR The proposed Error-Confinement function requires a scheme for detecting a memory error for providing the needed a-priori information F and a look up table for storing the expected reference values to be used for replacing the erroneous data. Obviously the realization of such a scheme in a processor require i) the introduction of custom instructions and ii) micro-architectural enhancements that we discuss next. 1168 978-3-9815370-6-2/DATE16/ c 2016 EDAA