Adaptive Software Fault Tolerance Policies with Dynamic Real-Time Guarantees Paolo Bizzarri*, Andrea Bondavalli*, Edgar Nett**, Hermann Streich**, Fabio Tarini* * CNUCE-CNR, I-56126 Pisa, Italy ** Research Division of Responsive Systems, German National Center for Computer Science (GMD) D-53754 St. Augustin, Schloß Birlinghoven -draft version- Abstract Real-time applications with high dependability requirements demand for fault tolerance strategies. While for small systems with static behaviour policies based on worst case execution times can be used, this is not true for more complex systems, in which worst case execution times are partially unknown or differ drastically from their average execution time. In such cases often only a minimum of quality can be achieved. This paper proposes to combine fault-tolerant policies described by the FERT (Fault-tolerant Entity for Real Time) notation with the dynamic scheduling scheme TPS (TaskPair-Scheduling). TPS alleviates FERT’s precondition of completely known WCETs and provides a flexible implementation base to enable an easy mapping of FERT strategies to a runtime system. In a first step, a significant subset of FERT is investigated, which implies: the Recovery Block Scheme, N-Version Programming, and Imprecise Computations. TPS is utilised to guarantee different levels of quality, tailored to the application and the required level of fault tolerance, while guaranteeing, that a common deadline is met. 1 Introduction For Real-Time applications requiring high dependability, Fault Avoidance techniques, although necessary, do not allow to exclude the occurrence of faults at run-time. Then, Fault Tolerance (FT, in the following) policies must be employed. The necessary redundancy can be either statically sized on the worst case, or exploited by an adaptable behaviour of the application. In case of overload or lack of resources due to failures, adaptable use of redundancy can allow graceful degradation of the system; e.g. safety-critical tasks could be supported as long as possible, while non critical ones could receive fewer resources or even be omitted. To reach an adaptive use of redundancy, the application designer and the run-time support must co-operate in such a way that the designer specifies needs and flexibility of the application, while the run-time support takes them into account in managing system resources. This specification could also allow to integrate Fault Tolerance policies in the design and implementation of the application, and in the validation of its Real-Time requirements. Special notations have been proposed to this purpose, both as programming-in-the-large notation [Nie91] and as notations meant at an intermediate level [BSS95] between programming-in-the- large and programming-in-the-small. These notations have to be general, powerful and expressive, in order to give the designer the maximum freedom. With the FERT notation, fault tolerance strategies can be defined for complex real-time systems and tailored to the existing environment. One of the problems of providing fault tolerance in real-time applications is, that fault tolerance strategies are often very time- (resource-) intensive and that worst case execution times are often difficult to estimate. Imprecise estimations also add to more resource consumption which may lead in the end to unfeasible schedules. On the other hand, the effective resource consumption may often be much smaller than the estimated one, which means that the possibility for a successful computation before a given deadline is large. The problem for the run-time system is to guarantee, that given deadline is not violated, even if a computation does not succeed within its timing bounds. The TaskPair-Scheduling (TPS) scheme [Str95] matches this requirement. It provides a