data
Article
OFCOD: On the Fly Clustering Based Outlier Detection
Framework
†
Ahmed Elmogy
1,2,
*
,‡
, Hamada Rizk
2,‡
and Amany M. Sarhan
2,‡
Citation: Elmogy, A.; Rizk, H.;
Sarhan, A.M. OFCOD: On the Fly
Clustering Based Outlier Detection
Framework. Data 2021, 6, 1.
https://dx.doi.org/
10.3390/data6010001
Received: 31 October 2020
Accepted: 24 December 2020
Published: 30 December 2020
Publisher’s Note: MDPI stays neu-
tral with regard to jurisdictional clai-
ms in published maps and institutio-
nal affiliations.
Copyright: © 2020 by the authors. Li-
censee MDPI, Basel, Switzerland.
This article is an open access article
distributed under the terms and con-
ditions of the Creative Commons At-
tribution (CC BY) license (https://
creativecommons.org/licenses/by/
4.0/).
1
Computer Engineering Department, Prince Sattam Bin Abdulaziz University, Al-Kharj 16278, Saudi Arabia
2
Computers and Control Engineering Department, Faculty of Engineering, Tanta University, Tanta 31733,
Egypt; hamada.rizk@ejust.edu.eg (H.R.); amany_sarhan@f-eng.tanta.edu.eg (A.M.S.)
* Correspondence: a.elmogy@psau.edu.sa
† This paper is an extended version of A hybrid outlier detection algorithm based on partitioning clustering
and density measures, Published in: 2015 Tenth International Conference on Computer Engineering &
Systems (ICCES), 23–24 December 2015.
‡ These authors contributed equally to this work.
Abstract: In data mining, outlier detection is a major challenge as it has an important role in
many applications such as medical data, image processing, fraud detection, intrusion detection,
and so forth. An extensive variety of clustering based approaches have been developed to detect
outliers. However they are by nature time consuming which restrict their utilization with real-
time applications. Furthermore, outlier detection requests are handled one at a time, which means
that each request is initiated individually with a particular set of parameters. In this paper, the
first clustering based outlier detection framework, (On the Fly Clustering Based Outlier Detection
(OFCOD)) is presented. OFCOD enables analysts to effectively find out outliers on time with request
even within huge datasets. The proposed framework has been tested and evaluated using two real
world datasets with different features and applications; one with 699 records, and another with five
millions records. The experimental results show that the performance of the proposed framework
outperforms other existing approaches while considering several evaluation metrics.
Keywords: clustering; outlier detection; outlierness factor; similarity measure
1. Introduction
An outlier is a data point that differs significantly from the other points in the
dataset [1]. Outliers affect the performance of data analysis algorithms and lead up to
misleading results [2]. Thus, detecting outliers becomes an important step in data analysis
and data mining. Outlier detection algorithms detect rare or abnormal behavior that can be
considered important than normal behavior in many applications such as cancer diagnosis,
product defects detection, fraud credit card transactions and hacking network traffic data.
Outlier detection algorithms are also used as a preprocessing step for the data mining
algorithms to filter datasets from outliers [3].
Outlier detection algorithms have extensively been tackled in the past fifteen years.
Many algorithms with different approaches have been introduced in the literature [4–11]
which can be, in general, categorized into [12–14]: statistical-based [15,16], distance-based [17,18],
density-based [19,20] and clustering-based methods [9,21–23]. Statistical-based approaches aim
at finding the probability distribution/model of the underlying normal data and define
outliers as those points that do not conform to that model. However, a single distribution
may not model the entire data that may originate from multiple unknown distributions
limiting the practical adoption of such approaches especially in high dimensional data.
Distance-based approaches define outliers as the points that are located far away from the
majority points using any distance metric like Manhattan distance or Euclidean distance
metrics. These approaches fail when the data points have different spatial densities (spar-
sity) and thus defining an outlierness distance is not practically feasible [24–26]. To combat
Data 2021, 6, 1. https://doi.org/10.3390/data6010001 https://www.mdpi.com/journal/data