Model-based runtime analysis of distributed reactive systems Andreas Bauer Martin Leucker Christian Schallhart Institut f¨ ur Informatik, Technische Universit¨ at M ¨ unchen {baueran, leucker, schallha}@informatik.tu-muenchen.de Abstract Reactive distributed systems have pervaded everyday life and objects, but often lack measures to ensure adequate be- haviour in the presence of unforeseen events or even errors at runtime. As interactions and dependencies within dis- tributed systems increase, the problem of detecting failures which depend on the exact situation and environment condi- tions they occur in grows. As a result, not only the detection of failures is increasingly difficult, but also the differentia- tion between the symptoms of a fault, and the actual fault itself, i. e., the cause of a problem. In this paper, we present a novel and efficient approach for analysing reactive distributed systems at runtime, in that we provide a framework for detecting failures as well as identifying their causes. Our approach is based upon mon- itoring safety-properties, specified in the linear time tempo- ral logic LTL (respectively, TLTL) to automatically generate monitor components which detect violations of these prop- erties. Based on the results of the monitors, a dedicated di- agnosis is then performed in order to identify explanations for the misbehaviour of a system. These may be used to store detailed log files, or to trigger recovery measures. Our framework is built modular, layered, and uses merely a min- imal communication overhead—especially when compared to other, similar approaches. Further, we sketch first experi- mental results from our implementations, and describe how it can be used to build a variety of distributed systems using our techniques. 1. Introduction Reactive real-time systems are increasingly embedded and, due to modern communication and fault-tolerant bus technologies, also increasingly laid out as distributed sys- tems. Often they control safety-critical applications and have already pervaded everyday life, e. g., in terms of au- tomotive control-systems used in present-day cars, mobile phones, or modern aircraft systems. In general terms, a real-time system is one in which the temporal aspects are part of its specification. As such not only the correctness of a computed result is crucial, but also the time at which it is produced. In case of an embedded system, it is usually the environment which imposes a strict frequency upon the system which needs to react and re- spond, i. e., follow hard deadlines. Such systems are more precisely referred to as reactive systems [11]. However, not only embedded systems can be reactive; many business in- formation systems are also typically labelled as being real- time sensitive, or reactive. Unlike in the embedded world, however, many deadlines in business information systems are soft deadlines, i. e., some of them may be missed by the system without fatal consequences on the environment or even human life. The design and development of embedded systems, es- pecially in a safety-critical setting such as automotive, for instance, can be guided by the use of formal methods [28], such as model checking or deductive reasoning, in order to increase our confidence in the correctness of the sys- tem. However, formal methods employed in the design and development process alone cannot guarantee that systems are sufficiently prepared to deal with unforeseen events or even errors, probably induced by the environment. More so, certain assumptions made during the development process, e. g., predetermined fault models, may prove to be inade- quate in a real-world setting. 1.1. Related work Although a lot of today’s systems are equipped with cus- tom built-in diagnostic mechanisms, they usually provide insufficient means to distinguish between the symptoms of a fault, i. e., an observed failure, and the actual fault itself, i. e., its cause. Diagnostics is then often reduced to a mere recording of symptoms. To address this problem, various improvements were suggested as well as implemented, for instance, adding additional knowledge about the system un- der scrutiny in terms of cause and symptom “tables”, re- flecting the effects of certain failures [17, 14]. These may be obtained prior from a dedicated hazard and risk analysis, or directly from the engineers who designed the system and