IIE Transactions (2011) 43, 647–660 Copyright C “IIE” ISSN: 0740-817X print / 1545-8830 online DOI: 10.1080/0740817X.2010.546385 Event log modeling and analysis for system failure prediction YUAN YUAN 1 , SHIYU ZHOU 1,∗ , CRISPIAN SIEVENPIPER 2 , KAMAL MANNAR 2 and YIBIN ZHENG 2 1 Department of Industrial and Systems Engineering, University of Wisconsin–Madison, Madison, WI 53706, USA E-mail: szhou@engr.wisc.edu 2 GE Healthcare, Pewaukee, WI 53072, USA Received August 2009 and accepted November 2010 Event logs, commonly available in modern mechatronic systems, contain rich information on the operating status and working conditions of the system. This article proposes a method to build a statistical model using event logs for system failure prediction. To achieve the best prediction performance, prescreening and statistical variable selection are adopted to select the best set of predictor events, coded as covariates in the statistical model. In-depth discussion of the prediction power of the model in terms of false alarm and misdetection probability is presented. Using a real-world example, the effectiveness of the proposed method is further confirmed. Keywords: Event logs, Cox proportional hazard model, variable selection, prediction power 1. Introduction The rapid developments that have occurred in information technology have created the ability to automatically collect data on the events that occur in a mechatronic system while the system is in use. For example, manufacturers have im- plemented data acquisition and transmission systems that collect system event logs from installed medical diagnos- tic imaging systems. The events recorded in the event logs are related to various machine activities, critical system failures, operator/user actions, task status, etc. Figure 1 illustrates a simplified event sequence from a Computer Tomography (CT) machine that contains 18 events of four different types. In this figure, K represents a system failure, such as the “scan abort” in CT machines, and A , B, and C represent other system event types. For example, event A could indicate that the temperature at a location in the ma- chine is above a certain level and event B could indicate a communication error within the machine. The occurrences of events have been marked along a timeline. In practice, a large number of events are often recorded. For instance, the event log of a typical CT machine during a 1-month period could contain more than 1000 000 events within 200 different types. It is generally believed that the event logs can act as a rich information source about the system’s working conditions and they can be used for condition monitoring, diagnosis, and maintenance decision making. For example, a faulty detector in a CT machine will eventually lead to a scan abort ∗ Corresponding author failure. However, before its total failure, a faulty detector can cause a series of other events such as an analog-to- digital converter error, communication error, or software error. By observing these preceding events (subsequently called predictor events), we can predict that the key failure event is about to occur. With accurate failure prediction, preventive maintenance can be conducted to reduce unex- pected machine downtimes and maintenance costs. Thus, it is highly desirable to develop a modeling and analysis methodology for event logs to enable the accurate (in a sta- tistical sense) prediction of the occurrence of failure events. In order to achieve this objective, a critical step is to es- tablish a rigorous mathematical model to describe the rela- tionships between the failure events and other events in the event log. Formally, an event sequence S is a triplet (T s , T e , s ) defined on a set of events E, where T s and T e are the start and end times of the sequence, respectively, and s =< ( E 1 , t 1 ), ( E 2 , t 2 ), . . . , ( E m , t m ) > is an ordered sequence of events such that E i ∈ E for all i =1, 2, . . . , m and t i is the occur- rence time of E i with T s ≤ t 1 ≤···≤ t m ≤ T e . The problem of predicting failure occurrences can be formulated as fol- lows: given the event sequence S containing, among others, occurrences of the failure event K , how do we construct a mathematical model to predict the occurrence of the failure event K ? Techniques to predict the failure event(s) based on the analysis of event sequence data have been proposed in the literature. These methods can be roughly classified into design-based methods or data-driven rule-based methods. In the design-based methods, the expected event sequence is obtained from a system design and it is compared with the observed event sequence. The system failure is identified 0740-817X C 2011 “IIE”