Practical Online Failure Prediction for Blue Gene/P: Period-based vs Event-driven Li Yu, Ziming Zheng, Zhiling Lan Department of Computer Science Illinois Institute of Technology {lyu17, zzheng11, lan}@iit.edu Susan Coghlan Leadership Computing Facility Argonne National Laboratory smc@alcf.anl.gov Abstract—To facilitate proactive fault management in large- scale systems such as IBM Blue Gene/P, online failure prediction is of paramount importance. While many techniques have been presented for online failure prediction, questions arise regarding two commonly used approaches: period-based and event-driven. Which one has better accuracy? What is the best observation window (i.e., the time interval used to collect evidence before making a prediction)? How does the lead time (i.e., the time interval from the prediction to the failure occurrence) impact prediction arruracy? To answer these questions, we analyze and compare period-based and event-driven prediction approaches via a Bayesian prediction model. We evaluate these prediction approaches, under a variety of testing parameters, by means of RAS logs collected from a production supercomputer at Argonne National Laboratory. Experimental results show that the period- based Bayesian model and the event-driven Bayesian model can achieve up to 65.0% and 83.8% prediction accuracy, respectively. Furthermore, our sensitivity study indicates that the event-driven approach seems more suitable for proactive fault management in large-scale systems like Blue Gene/P. I. I NTRODUCTION A. Motivation Proactive fault management has been studied to meet the increasing demands of reliability and availability in large- scale systems. The process of proactive fault management usually consists of four steps: online failure prediction, further diagnosis, action scheduling and execution of actions [15]. It is widely acknowledged that online failure prediction is crucial for proactive fault management. The accuracy of failure prediction can greatly impact the effectiveness of fault man- agement. On one hand, a fault tolerant action, as a response to a failure warning, becomes useless if the prediction itself is a false alarm. Consequently, in case of too many false alarms, a high management overhead may be introduced due to a large amount of unnecessary fault management actions. On the other hand, if too many failures are missed by the predictor, the effectiveness of fault management is questionable. Li et al. have shown that run-time fault management can be effective only when the prediction can achieve an acceptable accuracy. Generally speaking, online failure prediction methods can be classiﬁed into two groups: the period-based approach and the event-driven approach, differing in the trigger mechanism [15]. 1) Period-based approach: Typically, a prediction cycle of a period-based method consists of three parts as shown in Figure 1: an observation window W obs ,a lead time W lt and a prediction window W pdt . W obs is usually composed of a set of consecutive time intervals I = {I 1 ,I 2 , ..., I n }, where each interval has the same size as W pdt , so W obs is n times longer than W pdt . In a prediction cycle, the observation window W obs is used to collect evidence that determines whether a failure will occur within the prediction window W pdt . Lead time is the time interval preceding the the time of failure occurrence. To be practical, lead time is supposed to be long enough to perform a desired proactive fault prevention. Fig. 1. Period-based approach 2) Event-driven approach: In an event-driven method, the triggering of a failure alarm is determined by events. Strictly speaking, the predictor needs to continuously keep track of every event occurrence until a failure alarm. However, in practice, there still exists an observation window W obs for event-driven approach. There are two reasons for doing so. First, it is impractical to keep track of every event occurring before a failure due to the potential amount of events that could happen in a large-scale system. Second, many studies have shown that the events occurred too far away from a failure are less likely correlated to the failure. Hence, in an event-driven method, the predictor keeps on moving W obs forward and the events outside of W obs are not considered. Figure 2 illustrates the main components of a prediction cycle in the event-driven approach: W obs , W lt and F ailure. Unlike the period-based approach, a predictor using the event-driven approach predicts whether a failure will occur or not right after W lt . B. Main Contributions Both event-driven and period based approaches have great potential for fault management in large-scale systems. In this paper, we analyze and compare the impact of observation window and lead time on both period-based and event-driven