data Article OFCOD: On the Fly Clustering Based Outlier Detection Framework † Ahmed Elmogy 1,2, * ,‡ , Hamada Rizk 2,‡ and Amany M. Sarhan 2,‡   Citation: Elmogy, A.; Rizk, H.; Sarhan, A.M. OFCOD: On the Fly Clustering Based Outlier Detection Framework. Data 2021, 6, 1. https://dx.doi.org/ 10.3390/data6010001 Received: 31 October 2020 Accepted: 24 December 2020 Published: 30 December 2020 Publisher’s Note: MDPI stays neu- tral with regard to jurisdictional clai- ms in published maps and institutio- nal afﬁliations. Copyright: © 2020 by the authors. Li- censee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and con- ditions of the Creative Commons At- tribution (CC BY) license (https:// creativecommons.org/licenses/by/ 4.0/). 1 Computer Engineering Department, Prince Sattam Bin Abdulaziz University, Al-Kharj 16278, Saudi Arabia 2 Computers and Control Engineering Department, Faculty of Engineering, Tanta University, Tanta 31733, Egypt; hamada.rizk@ejust.edu.eg (H.R.); amany_sarhan@f-eng.tanta.edu.eg (A.M.S.) * Correspondence: a.elmogy@psau.edu.sa † This paper is an extended version of A hybrid outlier detection algorithm based on partitioning clustering and density measures, Published in: 2015 Tenth International Conference on Computer Engineering & Systems (ICCES), 23–24 December 2015. ‡ These authors contributed equally to this work. Abstract: In data mining, outlier detection is a major challenge as it has an important role in many applications such as medical data, image processing, fraud detection, intrusion detection, and so forth. An extensive variety of clustering based approaches have been developed to detect outliers. However they are by nature time consuming which restrict their utilization with real- time applications. Furthermore, outlier detection requests are handled one at a time, which means that each request is initiated individually with a particular set of parameters. In this paper, the ﬁrst clustering based outlier detection framework, (On the Fly Clustering Based Outlier Detection (OFCOD)) is presented. OFCOD enables analysts to effectively ﬁnd out outliers on time with request even within huge datasets. The proposed framework has been tested and evaluated using two real world datasets with different features and applications; one with 699 records, and another with ﬁve millions records. The experimental results show that the performance of the proposed framework outperforms other existing approaches while considering several evaluation metrics. Keywords: clustering; outlier detection; outlierness factor; similarity measure 1. Introduction An outlier is a data point that differs signiﬁcantly from the other points in the dataset [1]. Outliers affect the performance of data analysis algorithms and lead up to misleading results [2]. Thus, detecting outliers becomes an important step in data analysis and data mining. Outlier detection algorithms detect rare or abnormal behavior that can be considered important than normal behavior in many applications such as cancer diagnosis, product defects detection, fraud credit card transactions and hacking network trafﬁc data. Outlier detection algorithms are also used as a preprocessing step for the data mining algorithms to ﬁlter datasets from outliers [3]. Outlier detection algorithms have extensively been tackled in the past fifteen years. Many algorithms with different approaches have been introduced in the literature [4–11] which can be, in general, categorized into [12–14]: statistical-based [15,16], distance-based [17,18], density-based [19,20] and clustering-based methods [9,21–23]. Statistical-based approaches aim at ﬁnding the probability distribution/model of the underlying normal data and deﬁne outliers as those points that do not conform to that model. However, a single distribution may not model the entire data that may originate from multiple unknown distributions limiting the practical adoption of such approaches especially in high dimensional data. Distance-based approaches deﬁne outliers as the points that are located far away from the majority points using any distance metric like Manhattan distance or Euclidean distance metrics. These approaches fail when the data points have different spatial densities (spar- sity) and thus deﬁning an outlierness distance is not practically feasible [24–26]. To combat Data 2021, 6, 1. https://doi.org/10.3390/data6010001 https://www.mdpi.com/journal/data