Lessons learned from spatial and temporal correlation of node failures in high performance computers Siavash Ghiasvand , Florina M. Ciorba * , Ronny Tsch¨ uter , and Wolfgang E. Nagel Technische Universit¨ at Dresden, Dresden, Germany. Email: {firstname.lastname}@tu-dresden.de * University of Basel, Basel, Switzerland. Email: florina.ciorba@unibas.ch Abstract—In this paper we study the correlation of node failures in time and space. Our study is based on measurements of a production high performance computer over an 8-month time period. We draw possible types of correlations between node failures and show that, in many cases, there are direct correlations between observed node failures. The significance of such a study is twofold: achieving a clearer understanding of correlations between node failures and enabling failure detection as early as possible. The results of this study are aimed at helping the system administrators minimize (or even prevent) the destructive effects of correlated node failures. I. I NTRODUCTION The failure rate of high performance computers rapidly increases due to their growth in size and complexity. Failures, thus, become the norm rather than the exception. There are several de-facto failure recovery mechanisms for high perfor- mance computers (e.g., checkpoint-restart, duplication, and re- execution). The efficiency of recovery mechanisms depends on the mean time between failures (MTBF). It is expected that in the near future, the MTBF of high performance computers becomes too short, such that current de-facto failure recovery mechanisms will no longer be able to recover the system from failures [1]. Early failure detection is a new class of failure recovery methods which can be beneficial for high performance computers with low MTBF. Detecting failures in their early stage can reduce their negative effects by preventing propagation of their side effects to other parts of the system [2]. One way to detect failures in their early stage is via mon- itoring the nodes’ behavior and seeking behavioral anomalies. Performance is a key property of high performance computers. To prevent any performance penalty due to actively probing of nodes, we employ a passive monitoring approach. The native Linux message logging (also referred to as syslog) is the source of our monitoring information. The syslog daemon, records ongoing hardware- and software-related events of the system. In this paper, we refer to such hardware/software events in the syslog as syslog entries. When studying failures, the granularity of components plays an important role in interpreting the system behavior. As long as a node behaves as expected, it can withstand failures. When a node is unable to carry out its expected load, it is viewed as a “node outage”. Certain failures will lead to node outages (e.g., a failed switch). A node outage is an observable indication of a failure occurrence. A node outage is defined as the case when no syslog entries from a particular node can be observed. We consider three reasons for node outages: (1) site-wide power outages, (2) planned maintenance, and (3) other reasons. In this study we focus on the failures observed at the node level and derive correlations along three dimensions. (1) Temporal: denotes cases when the time gap between consecutive failures falls below a certain threshold; (2) Spatial: denotes cases when the failed nodes share a physical resource (e.g., chassis); and (3) Logical: denotes cases when the failed nodes share a logical resource (e.g., batch job). The main contribution of this paper is a detailed study that aims to help the system administrators minimize the destructive effects of correlated node failures. Our methodology is based on a cyclic workflow, as follows: System monitoring Analysis of monitoring data Deriva- tion of correlations Early failure detection Timely failure prediction. In this paper we concentrate on ”derivation of correlations”. The subsequent steps are part of future work. II. RELATED WORK Distributed computer systems are hierarchically structured, e.g., system, cluster, rack, and node. Furthermore, such systems share various resources, such as power supply units, network, or (distributed) file systems. As a result, failures in one location may trigger further failures and, as such, propagate between different system components. Failure correlations can be classified into temporal, spatial, logical, and combinations thereof. Yigitbasi et al. [3] investigated the temporal correlation of failures in large-scale distributed systems. The examined event logs featured strong daily patterns and high auto-correlation. Sahoo et al. [4] analyzed event logs of heterogeneous servers and also found different forms of strong correlation structures including significant periodic behavior. A proactive failure prediction system considering the temporal order of the events was described by Sahoo et al. [5]. Liang et al. [6] analyzed logs from an IBM BlueGene/L system and found skewness in the distribution of network failures. Gallet et al. [7] used a moving window to generate groups of spatially correlated failures from empirical data. They showed that spatial correlation of failures cannot be neglected for an accurate analysis of system downtimes. Fu et al. [8] processed event logs to identify event dependencies in order to improve failure prediction and root cause diagnosis. Fu and Xu [9] enhanced this approach by taking into account both temporal and spatial correlations. Their model used event clustering to quantify the temporal correlation and developed another model to characterize spatial correlation. They showed that failure events exhibit strong correlations in