Poster: Multi-level Anomaly Prediction in Tier-0 Datacenter: a Deep Learning Approach Mohsen Seyedkazemi Ardebili mohsen.seyedkazemi@unibo.it University of Bologna Bologna, Italy Andrea Bartolini a.bartolini@unibo.it University of Bologna Bologna, Italy Luca Benini luca.benini@unibo.it,lbenini@iis.ee.ethz.ch University of Bologna / ETH Zurich Bologna, Italy / Zurich, Switzerland ABSTRACT Modern scientifc discoveries are driven by an unsatisfable demand for computational resources. To solve large problems in science, engineering, and business, data centers provide High-Performance Computing (HPC) systems with aggregation of the computing ca- pacity of thousand of computing nodes. Anomaly prediction is critical in order to preserve the continuity of the service of HPC systems and prevent hardware deterioration. In the datacenter, a thermal anomaly occurs when the balance of cooling capacity and computational demand is disturbed. Moreover, this is identifable from a suspicious/abnormal pattern in the monitoring signals. In this poster, the anomaly prediction task in the HPC systems is investigated by defning complex statistical rules-based and Deep Learning DL-based anomaly detection methods, then utilizing these anomaly detection methods in an anomaly prediction framework. CCS CONCEPTS · Hardware Temperature monitoring. KEYWORDS Datacenter, HPC Systems, Deep Learning, Anomaly Prediction ACM Reference Format: Mohsen Seyedkazemi Ardebili, Andrea Bartolini, and Luca Benini. 2022. Poster: Multi-level Anomaly Prediction in Tier-0 Datacenter: a Deep Learn- ing Approach. In 19th ACM International Conference on Computing Frontiers (CF’22), May 17–19, 2022, Torino, Italy. ACM, New York, NY, USA, 2 pages. https://doi.org/10.1145/3528416.3530864 1 INTRODUCTION Each HPC cluster comprises thousands of computing nodes that may consume electrical power in the range of megawatts, which all of this electrical energy turns into heat. Predicting thermal hazards in time is extremely important to avoid IT and facility equipment damage and outage of the datacenter, with severe soci- etal and business losses[Seyedkazemi Ardebili et al. 2021]. In the SoA, thermal hazards have been studied with diferent method- ologies. [Cho et al. 2009] proposed to use simulators. [Athavale et al. 2018] proposed Machine Learning (ML) approaches, [Wang et al. 2009] proposed mathematical models, and fnally, [Tang et al. 2006],[Seyedkazemi Ardebili et al. 2022] proposed to use sensors Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for proft or commercial advantage and that copies bear this notice and the full citation on the frst page. Copyrights for third-party components of this work must be honored. For all other uses, contact the owner/author(s). CF’22, May 17–19, 2022, Torino, Italy © 2022 Copyright held by the owner/author(s). ACM ISBN 978-1-4503-9338-6/22/05. https://doi.org/10.1145/3528416.3530864 with a computer model to create the room’s heat map or thermal evolution model. This is the frst empirical study of anomaly detection and predic- tion techniques of a real large-scale HPC system to the best of our knowledge. This study is based on real data from an in-production HPC cluster and HPC room facilities. The monitoring data is col- lected by employing a holistic monitoring system, namely ExaMon, one of the SoA HPC monitoring systems [Bartolini et al. 2019]. 2 METHODOLOGY AND EXPERIMENTAL RESULTS We did experiments on monitoring data of the Marconi100 HPC cluster of CINECA, a Tier-0 cluster with about 32 PFlop/s computing capacity. It is ranked 9th (June 2020) and 18th (list of November 2021) in the list of the most powerful supercomputers worldwide. CINECA is the most powerful supercomputing center for scientifc research in Italy and one of the most powerful supercomputers in the world [Top500 List 2022]. 2.1 Thermal Anomaly Prediction In [Seyedkazemi Ardebili et al. 2021], we introduced a rule-based statistical tool for thermal hazard detection based on the statisti- cal analysis of two real reported thermal emergencies. This tool is adopted to generate ground-truth thermal anomaly binary labels for the HPC room for the whole year 2019. Then, a framework for thermal hazard prediction is suggested, which encompasses data query and preprocessing, model training, and fnal model inference, which provides the prediction. We studied diferent classical ma- chine learning and Deep Learning (DL) tools, and since Temporal Convolutional Network (TCN) outperforms non-deep models and Long Short-Term Memory (LSTM) for further study, we selected the TCN model. Suggested TCN has signifcant performance degen- eration in prediction (from F1-score of 0.98 to 0.74) when applied in a more realistic scenario (training limited to recent past data). We aim to improve the results by diferent strategies: (i) training on more historical data; (ii) addition of input metrics, prioritizing power consumption; (iii) 2D and 3D-convolutions; (iv) iterative retraining; and (v), more advanced anomaly detection approach employing DL for generating the thermal hazard label, to simu- late the real scenario even more accurately. Considering the model improvement and the frst four strategies to improve the perfor- mance, we did diferent experiments and introduced some new approaches for the data structure of input data. In brief, the 4D input data structure with 3D convolutional layers in the TCN ar- chitecture reaches the highest prediction performance (around 8% improvements in F1-score it reached 0.80). Augmenting the other metrics as the model’s input, like power consumption, degraded 197