computers
Article
The Use of Template Miners and Encryption in Log
Message Compression
Péter Marjai
1
, Péter Lehotay-Kéry
1
and Attila Kiss
1,2,
*
Citation: Marjai, P.; Lehotay-Kéry,P.;
Kiss, A. The Use of Template Miners
and Encryption in Log Message
Compression. Computers 2021, 10, 83.
https://doi.org/10.3390/
computers10070083
Academic Editor: George Angelos
Papadopoulos
Received: 31 May 2021
Accepted: 20 June 2021
Published: 23 June 2021
Publisher’s Note: MDPI stays neutral
with regard to jurisdictional claims in
published maps and institutional affil-
iations.
Copyright: © 2021 by the authors.
Licensee MDPI, Basel, Switzerland.
This article is an open access article
distributed under the terms and
conditions of the Creative Commons
Attribution (CC BY) license (https://
creativecommons.org/licenses/by/
4.0/).
1
Department of Information Systems, ELTE Eötvös Loránd University, 1117 Budapest, Hungary;
g7tzap@inf.elte.hu (P.M.); lkp@caesar.elte.hu (P.L.-K.)
2
Department of Informatics, J. Selye University, 94501 Komárno, Slovakia
* Correspondence: kiss@inf.elte.hu
Abstract: Presently, almost every computer software produces many log messages based on events
and activities during the usage of the software. These files contain valuable runtime information that
can be used in a variety of applications such as anomaly detection, error prediction, template mining,
and so on. Usually, the generated log messages are raw, which means they have an unstructured
format. This indicates that these messages have to be parsed before data mining models can be
applied. After parsing, template miners can be applied on the data to retrieve the events occurring in
the log file. These events are made from two parts, the template, which is the fixed part and is the
same for all instances of the same event type, and the parameter part, which varies for all the instances.
To decrease the size of the log messages, we use the mined templates to build a dictionary for the
events, and only store the dictionary, the event ID, and the parameter list. We use six template miners
to acquire the templates namely IPLoM, LenMa, LogMine, Spell, Drain, and MoLFI. In this paper,
we evaluate the compression capacity of our dictionary method with the use of these algorithms.
Since parameters could be sensitive information, we also encrypt the files after compression and
measure the changes in file size. We also examine the speed of the log miner algorithms. Based on
our experiments, LenMa has the best compression rate with an average of 67.4%; however, because
of its high runtime, we would suggest the combination of our dictionary method with IPLoM and
FFX, since it is the fastest of all methods, and it has a 57.7% compression rate.
Keywords: log file processing; template mining; compression; encryption
1. Introduction
Creating logs is a common practice in programming, which is used to store runtime
information of a software system. It is carried out by the developers who insert logging
statements into the source code of the applications. Since log files contain all the important
information, they can be used for numerous purposes, such as outlier detection [1,2],
performance monitoring [3,4], fault localization [5], office tracking [6], business model
mining [7], or reliability engineering [8].
Outlier detection (also known as anomaly detection) is done by detecting unusual
log messages that differ significantly from the rest of the messages, thus raising suspicion.
These messages can be used to pinpoint the cause of the problem such as errors in a
text, structural defects, or network intrusion. For example, a log message with high
temperature values could indicate a misfunctioning ventilator. The authors of “Anomaly
Detection from Log Files Using Data Mining Techniques” [1] proposed an anomaly-based
approach using data mining of logs, and the overall error rates of their method were below
10%. There are three main types of anomaly detection methods, such as K-Means+ID3
(supervised) [9], DeepAnT (unsupervised) [10] or GANomally (semi-supervised) [11].
Supervised techniques work based on data sets that have been labeled “normal” and
“abnormal”. Unsupervised algorithms use unlabeled datasets. Semi-supervised detection
creates a model that represents normal behavior [2].
Computers 2021, 10, 83. https://doi.org/10.3390/computers10070083 https://www.mdpi.com/journal/computers