An Intellectual Property Core to Detect Task Schedulling-Related Faults in RTOS-Based Embedded Systems Dhiego Silva, Letícia Bolzani, Fabian Vargas Electrical Engineering Dept., Catholic University – PUCRS Av. Ipiranga 6681, 90619-900, Porto Alegre, Brazil leticia@poehls.com vargas@computer.org Abstract— The use of Real-Time Operating Systems (RTOSs) became an attractive solution to simplify the design of safety- critical real-time embedded systems. Due to their stringent constraints such as battery-powered, high-speed and low-voltage operation, these systems are often subject to transient faults originated from a large spectrum of noisy sources, among them, the conducted and radiated Electromagnetic Interference (EMI). As the major consequence, the system’s reliability degrades. In this paper, we present a hardware-based intellectual property (IP) core, namely RTOS-Guardian (RTOS-G) able to monitor the RTOS’ execution in order to detect faults that corrupt the tasks’ execution flow in embedded systems based on preemptive RTOS. Experimental results based on the Plasma microprocessor IP core running different test programs that exploit several RTOS resources have been developed. During test execution, the proposed system was exposed to conducted EMI according to the international standard IEC 61.000-4-29 for voltage dips, short interruptions and voltage transients on the power supply lines of electronic systems. The obtained results demonstrate that the proposed approach is able to provide higher fault coverage and reduced fault latency when compared to the native fault detection mechanisms embedded in the kernel of the RTOS. Keywords- Hardware-Based Approach, Intellectual Property (IP) Core, Real-Time Operating System, Reliable Embedded System, Electromagnetic Interference (EMI). I. INTRODUCTION Nowadays, several safety-critical embedded systems support real-time applications, which have to respect stringent timing constraints. In general terms, real-time systems have to provide not only logically correct results, but temporally correct results as well [1]. The high complexity of real-time systems has increased the necessity to adopt Real-Time Operating Systems (RTOSs) in order to simplify their design. Typically, these systems exploit some important facilities associated to RTOSs’ native intrinsic mechanisms to manage tasks, concurrency, memory as well as interrupts. In other words, RTOSs serve as an interface between software and hardware. At the same time, the environment’s always increasing hostility caused substantially by the ubiquitous adoption of wireless technologies represents a huge challenge for the reliability of real-time embedded systems [2,3]. Note that if these systems are powered by battery, the yielded reliability is even more fragile. In detail, external conditions, such as Electromagnetic Interference (EMI), Heavy-Ion Radiation (HIR) as well as Power Supply Disturbances (PSD) may cause transient faults on electronic systems [4][5][6][7]. Currently, the consequences of transient faults represent a well-known concern in microelectronic systems. The International Technology Roadmap for Semiconductor (ITRS) predicts increasing system failure rates due to this type of fault for future generation of integrated circuits [10]. Therefore, embedded systems based on RTOS are subject to Single Event Upsets (SEUs) causing transient faults, which can affect the application running on embedded systems as well as the RTOS executing the application [8][9]. Affecting the RTOS, this kind of fault can generate scheduling dysfunctions that could lead to incorrect system behavior [1]. Up to now, several solutions have been proposed in order to deal with the reliability problems of real-time systems [11][12][13][14]. However, it is important to note that such solutions provide fault tolerance only for the application level and do not consider faults affecting the RTOS that propagate to the application tasks [1]. Typically, these techniques are focused on detecting errors (on the application level) that corrupt data manipulated by the processor and/or induce application illegal control-flow execution. Regarding faults affecting the RTOS that propagate to application tasks, about 21% of them lead to application failure [1] and then, are liable to be detected by such type of solutions. Generally, these faults tend to miss their deadlines and to produce incorrect output results. Moreover, the work presented in [8] demonstrates that about 34% of the faults injected in the processor’s registers led to scheduling dysfunctions. Indeed, about 44% of these dysfunctions led to system crashes, about 34% caused real-time problems and the remaining 22% generated incorrect system output results. To conclude, the fault tolerance techniques proposed up to now represent feasible solutions, but they do not guarantee that each task respects its deadline. In this paper we present a hardware-based approach to monitor the RTOS’s execution flow in order to detect scheduling misbehavior. In more detail, the proposed approach 19 978-1-4577-1056-8/11/$26.00 c 2011 IEEE 2011 IEEE 17th International On-Line Testing Symposium