Poster: Multi-level Anomaly Prediction in Tier-0 Datacenter:
a Deep Learning Approach
Mohsen Seyedkazemi Ardebili
mohsen.seyedkazemi@unibo.it
University of Bologna
Bologna, Italy
Andrea Bartolini
a.bartolini@unibo.it
University of Bologna
Bologna, Italy
Luca Benini
luca.benini@unibo.it,lbenini@iis.ee.ethz.ch
University of Bologna / ETH Zurich
Bologna, Italy / Zurich, Switzerland
ABSTRACT
Modern scientifc discoveries are driven by an unsatisfable demand
for computational resources. To solve large problems in science,
engineering, and business, data centers provide High-Performance
Computing (HPC) systems with aggregation of the computing ca-
pacity of thousand of computing nodes. Anomaly prediction is
critical in order to preserve the continuity of the service of HPC
systems and prevent hardware deterioration. In the datacenter, a
thermal anomaly occurs when the balance of cooling capacity and
computational demand is disturbed. Moreover, this is identifable
from a suspicious/abnormal pattern in the monitoring signals.
In this poster, the anomaly prediction task in the HPC systems is
investigated by defning complex statistical rules-based and Deep
Learning DL-based anomaly detection methods, then utilizing these
anomaly detection methods in an anomaly prediction framework.
CCS CONCEPTS
· Hardware → Temperature monitoring.
KEYWORDS
Datacenter, HPC Systems, Deep Learning, Anomaly Prediction
ACM Reference Format:
Mohsen Seyedkazemi Ardebili, Andrea Bartolini, and Luca Benini. 2022.
Poster: Multi-level Anomaly Prediction in Tier-0 Datacenter: a Deep Learn-
ing Approach. In 19th ACM International Conference on Computing Frontiers
(CF’22), May 17–19, 2022, Torino, Italy. ACM, New York, NY, USA, 2 pages.
https://doi.org/10.1145/3528416.3530864
1 INTRODUCTION
Each HPC cluster comprises thousands of computing nodes that
may consume electrical power in the range of megawatts, which
all of this electrical energy turns into heat. Predicting thermal
hazards in time is extremely important to avoid IT and facility
equipment damage and outage of the datacenter, with severe soci-
etal and business losses[Seyedkazemi Ardebili et al. 2021]. In the
SoA, thermal hazards have been studied with diferent method-
ologies. [Cho et al. 2009] proposed to use simulators. [Athavale
et al. 2018] proposed Machine Learning (ML) approaches, [Wang
et al. 2009] proposed mathematical models, and fnally, [Tang et al.
2006],[Seyedkazemi Ardebili et al. 2022] proposed to use sensors
Permission to make digital or hard copies of part or all of this work for personal or
classroom use is granted without fee provided that copies are not made or distributed
for proft or commercial advantage and that copies bear this notice and the full citation
on the frst page. Copyrights for third-party components of this work must be honored.
For all other uses, contact the owner/author(s).
CF’22, May 17–19, 2022, Torino, Italy
© 2022 Copyright held by the owner/author(s).
ACM ISBN 978-1-4503-9338-6/22/05.
https://doi.org/10.1145/3528416.3530864
with a computer model to create the room’s heat map or thermal
evolution model.
This is the frst empirical study of anomaly detection and predic-
tion techniques of a real large-scale HPC system to the best of our
knowledge. This study is based on real data from an in-production
HPC cluster and HPC room facilities. The monitoring data is col-
lected by employing a holistic monitoring system, namely ExaMon,
one of the SoA HPC monitoring systems [Bartolini et al. 2019].
2 METHODOLOGY AND EXPERIMENTAL
RESULTS
We did experiments on monitoring data of the Marconi100 HPC
cluster of CINECA, a Tier-0 cluster with about 32 PFlop/s computing
capacity. It is ranked 9th (June 2020) and 18th (list of November
2021) in the list of the most powerful supercomputers worldwide.
CINECA is the most powerful supercomputing center for scientifc
research in Italy and one of the most powerful supercomputers in
the world [Top500 List 2022].
2.1 Thermal Anomaly Prediction
In [Seyedkazemi Ardebili et al. 2021], we introduced a rule-based
statistical tool for thermal hazard detection based on the statisti-
cal analysis of two real reported thermal emergencies. This tool is
adopted to generate ground-truth thermal anomaly binary labels
for the HPC room for the whole year 2019. Then, a framework for
thermal hazard prediction is suggested, which encompasses data
query and preprocessing, model training, and fnal model inference,
which provides the prediction. We studied diferent classical ma-
chine learning and Deep Learning (DL) tools, and since Temporal
Convolutional Network (TCN) outperforms non-deep models and
Long Short-Term Memory (LSTM) for further study, we selected
the TCN model. Suggested TCN has signifcant performance degen-
eration in prediction (from F1-score of 0.98 to 0.74) when applied
in a more realistic scenario (training limited to recent past data).
We aim to improve the results by diferent strategies: (i) training
on more historical data; (ii) addition of input metrics, prioritizing
power consumption; (iii) 2D and 3D-convolutions; (iv) iterative
retraining; and (v), more advanced anomaly detection approach
employing DL for generating the thermal hazard label, to simu-
late the real scenario even more accurately. Considering the model
improvement and the frst four strategies to improve the perfor-
mance, we did diferent experiments and introduced some new
approaches for the data structure of input data. In brief, the 4D
input data structure with 3D convolutional layers in the TCN ar-
chitecture reaches the highest prediction performance (around 8%
improvements in F1-score it reached 0.80). Augmenting the other
metrics as the model’s input, like power consumption, degraded
197