computers Article The Use of Template Miners and Encryption in Log Message Compression Péter Marjai 1 , Péter Lehotay-Kéry 1 and Attila Kiss 1,2, *   Citation: Marjai, P.; Lehotay-Kéry,P.; Kiss, A. The Use of Template Miners and Encryption in Log Message Compression. Computers 2021, 10, 83. https://doi.org/10.3390/ computers10070083 Academic Editor: George Angelos Papadopoulos Received: 31 May 2021 Accepted: 20 June 2021 Published: 23 June 2021 Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional afﬁl- iations. Copyright: © 2021 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https:// creativecommons.org/licenses/by/ 4.0/). 1 Department of Information Systems, ELTE Eötvös Loránd University, 1117 Budapest, Hungary; g7tzap@inf.elte.hu (P.M.); lkp@caesar.elte.hu (P.L.-K.) 2 Department of Informatics, J. Selye University, 94501 Komárno, Slovakia * Correspondence: kiss@inf.elte.hu Abstract: Presently, almost every computer software produces many log messages based on events and activities during the usage of the software. These ﬁles contain valuable runtime information that can be used in a variety of applications such as anomaly detection, error prediction, template mining, and so on. Usually, the generated log messages are raw, which means they have an unstructured format. This indicates that these messages have to be parsed before data mining models can be applied. After parsing, template miners can be applied on the data to retrieve the events occurring in the log ﬁle. These events are made from two parts, the template, which is the ﬁxed part and is the same for all instances of the same event type, and the parameter part, which varies for all the instances. To decrease the size of the log messages, we use the mined templates to build a dictionary for the events, and only store the dictionary, the event ID, and the parameter list. We use six template miners to acquire the templates namely IPLoM, LenMa, LogMine, Spell, Drain, and MoLFI. In this paper, we evaluate the compression capacity of our dictionary method with the use of these algorithms. Since parameters could be sensitive information, we also encrypt the ﬁles after compression and measure the changes in ﬁle size. We also examine the speed of the log miner algorithms. Based on our experiments, LenMa has the best compression rate with an average of 67.4%; however, because of its high runtime, we would suggest the combination of our dictionary method with IPLoM and FFX, since it is the fastest of all methods, and it has a 57.7% compression rate. Keywords: log ﬁle processing; template mining; compression; encryption 1. Introduction Creating logs is a common practice in programming, which is used to store runtime information of a software system. It is carried out by the developers who insert logging statements into the source code of the applications. Since log ﬁles contain all the important information, they can be used for numerous purposes, such as outlier detection [1,2], performance monitoring [3,4], fault localization [5], ofﬁce tracking [6], business model mining [7], or reliability engineering [8]. Outlier detection (also known as anomaly detection) is done by detecting unusual log messages that differ signiﬁcantly from the rest of the messages, thus raising suspicion. These messages can be used to pinpoint the cause of the problem such as errors in a text, structural defects, or network intrusion. For example, a log message with high temperature values could indicate a misfunctioning ventilator. The authors of “Anomaly Detection from Log Files Using Data Mining Techniques” [1] proposed an anomaly-based approach using data mining of logs, and the overall error rates of their method were below 10%. There are three main types of anomaly detection methods, such as K-Means+ID3 (supervised) [9], DeepAnT (unsupervised) [10] or GANomally (semi-supervised) [11]. Supervised techniques work based on data sets that have been labeled “normal” and “abnormal”. Unsupervised algorithms use unlabeled datasets. Semi-supervised detection creates a model that represents normal behavior [2]. Computers 2021, 10, 83. https://doi.org/10.3390/computers10070083 https://www.mdpi.com/journal/computers