I.J. Intelligent Systems and Applications, 2021, 1, 58-68
Published Online February 2021 in MECS (http://www.mecs-press.org/)
DOI: 10.5815/ijisa.2021.01.05
Copyright © 2021 MECS I.J. Intelligent Systems and Applications, 2021, 1, 58-68
Genetic-based Summarization for Local Outlier
Detection in Data Stream
Mohamed Sakr, Walid Atwa and Arabi Keshk
Computer Science Dept. Faculty of Computers and Information, Menoufia University, Egypt
E-mail: mssakr@ymail.com, walid.atwa@ci.menofia.edu.eg, arabikeshk@yahoo.com
Received: 03 February 2020; Accepted: 16 March 2020; Published: 08 February 2021
Abstract: Outlier detection is one of the important tasks in data mining. Detecting outliers over streaming data has
become an important task in many applications, such as network analysis, fraud detections, and environment monitoring.
One of the well-known outlier detection algorithms called Local Outlier Factor (LOF). However, the original LOF has
many drawbacks that can’t be used with data streams: 1- it needs a lot of processing power (CPU) and large memory to
detect the outliers. 2- it deals with static data which mean that in any change in data the LOF recalculates the outliers
from the beginning on the whole data. These drawbacks make big challenges for existing outlier detection algorithms in
terms of their accuracies when they are implemented in the streaming environment. In this paper, we propose a new
algorithm called GSILOF that focuses on detecting outliers from data streams using genetics. GSILOF solve the
problem of large memory needed as it has fixed memory bound. GSILOF has two phases. First, the summarization
phase that tries to summarize the past data arrived. Second, the detection phase detects the outliers from the new
arriving data. The summarization phase uses a genetic algorithm to try to find the subset of points that can represent the
whole original set. our experiments have been done over real datasets. Our experiments confirming the effectiveness of
the proposed approach and the high quality of approximate solutions in a set of real-world streaming data.
Index Terms: Outlier detection, data streams, local outlier factor, genetics.
1. Introduction
Outlier detection is also known as anomaly detection has gained a lot of importance and attention in the field of
data mining. It has been used in many applications such as credit card fraud detection and intrusion detection in web
apps. A lot of algorithms have been developed to detect outliers in static data in which the number of points are
determined and doesn’t change over time. However, detecting outliers on streamed data is difficult because the size of
the data set is infinite, and the data is changing over time thus can’t be stored in memory for processing [1].
One of the techniques that are used in outlier detection is density-based techniques. Density-based techniques have
a great ability to detect outliers in different densities and dealing with nonhomogeneous densities datasets.
One of the well-known algorithms for outlier detection that is density based is Local Outlier Factor LOF. LOF has
been used in data sets with heterogeneous densities [2, 3, 4]. However, LOF deals with static data that don’t change
over time as its calculations are done over the whole data one time. Because LOF did its calculations one time on the
whole data it needs a huge amount of memory to store the data to process. Specifically, LOF has O( n
2
) space
complexity to detect outliers as it stores all the points of the data and its distances between the all points. Also in any
change in data by adding or deleting any points the LOF needs to be recalculated on the whole data set. Such these
limitations of LOF, it can’t be used wit h data streams as data streams size are infinite and data are changing over time as
new points arrive [5].
A data stream is a continuous data records ordered by timestamps and the data points are available partially at any
given point in time. Thus, when working on applications with streaming data, their temporal contexts need to be
considered. In addition, the processing needs additional requirement on computational and memory resources. There are
many applications that detect outlier detection over streaming data, such as network detection, fraud detections, and
environmental monitoring. Thus, we need to find abnormal data over data streams in real-time.
Researchers have proposed different solutions to this problem. One of those solutions works by using sliding-
window in the application and performing learning only on those windowed data. This solution performs well in some
applications and also makes real-time results. However, the correctness of its results depends largely on the size of
window that is not considered. There are other existing solutions but most of them fail to address those properties of
streaming data, and thus produce results exhibiting poor accuracy [6].
In this paper, we aim to propose new algorithm called GSILOF (Genetic Summarizing Incremental LOF). that
overcome aforementioned challenges in streaming data. The GSILOF algorithm consists of two phases 1) detection