www.embedded-world.eu Fast Fault Injection to Evaluate Multicore Systems Soft Error Reliability Felipe Rosa 1 , Luciano Ost 2 , Ricardo Reis 1 , Simon Davidmann 3 , Larry Lapides 3 1 UFRGS - Instituto de Informatica - PGMicro/PPGC 2 Department of Engineering - University of Leicester 3 Imperas Software Ltd. {frdarosa, reis}@inf.ufrgs.br, luciano.ost@le.ac.uk, {simond, larryl}@imperas.com Abstract— The increasing complexity of processors allied to the continuous technology shrink is making multicore-based systems more susceptible to soft errors. The high cost and time inherent to hardware-based fault injection approaches make the more efficient simulation-based fault injection frameworks crucial to test reliability. This paper proposes a fast, flexible fault injector framework which supports parallel instruction accurate simulation to boost up the fault injection process. Fault injection campaigns were performed on ARM processors, considering a Linux Kernel and benchmarks with up to 220 million object code instructions. Results have shown the injection of faults at speeds up to 1550 MIPS. This enables users to identify errors and exceptions according to different criteria and classifications. Keywords—component; soft error; fault injection simulation; multicore systems. I. INTRODUCTION The increasing computing capacity of multicore components like processors and graphics processing units (GPUs) offers new opportunities for embedded and high performance computing (HPC) domains. The progressively growing computing capacity of multicore-based systems enables the efficient performance of complex application workloads at a lower power consumption compared to traditional single core solutions. Such efficiency and the ever- increasing complexity of application workloads encourage industry to integrate more and more computing components into the same system. The number of computing components employed in large-scale HPC systems already exceeds a million cores [1], while 1000-cores on-chip platforms are available in the embedded community [2]. Beyond the massive number of cores, the increasing computing capacity, as well as the number of internal memory cells (e.g. registers, internal memory, etc,) inherent to emerging processor architectures, is making large-scale systems more vulnerable to both hard and soft errors [3], [4]. Moreover, to meet emerging performance and power requirements, the underlying processors usually run in aggressive clock frequencies and multiple voltage domains, increasing their susceptibility to soft errors, such as the ones caused by radiation effects. The occurrence of soft errors or Single Event Effects (SEEs) may cause critical failures on system behavior, which may lead to financial or human life losses as already reported in [5], [6]. While a rate of 280 soft errors per day has been observed during the flight of a spacecraft [7], electronic computing systems working at ground level are expected to experience at least one soft error per day in near future [8]. The growing susceptibility of multicore systems to SEEs necessarily calls for novel cost- effective tools to assess the soft error resilience of underlying multicore components with complex software stacks (operating system-OS, drivers, etc.) early in the design phase. With this trend in mind, researchers are investigating new fault injector techniques as well as proposing new tools to evaluate the occurrence of SEEs in commercial state of the art processors. In this context, the use of virtual platform frameworks is attractive due to their simulation performance and design flexibility (i.e. support for a large number of component models, compilers, and debugging facilities). Due to the high simulation speed (typically at hundreds of MIPS), virtual platform simulators based on just in time (JIT) dynamic binary translation appear to have an advantage over event- driven simulators. However, this simulation performance comes at the cost of limited microarchitecture exploration support and timing accuracy. The resulting scenario poses a major challenging question: can we rely on soft error analysis produced from JIT-based frameworks? To address the gap between the available fault injection tools and the industry requirements, this paper describes the development of a fault injector module (FIM) that was assembled with OVPsim [9], [10], which relies on JIT dynamic binary translation technology. Aiming at answering the above challenging question on JIT simulation credibility, the developed FIM was integrated into gem5 [11], which is an event-driven virtual platform framework that targets microarchitecture exploration. The main contributions of this work are the following: Proposal of a fast and flexible fault injector framework, called OVPsim-FIM, which supports the