Outlier Detection with One-Class Classifiers from ML and KDD Jeroen H.M. Janssens, Ildiko Flesch, and Eric O. Postma Tilburg centre for Creative Computing Tilburg University Tilburg, The Netherlands Email: jeroen@jeroenjanssens.com, ildiko.flesch@gmail.com, eric.postma@gmail.com Abstract—The problem of outlier detection is well studied in the fields of Machine Learning (ML) and Knowledge Discovery in Databases (KDD). Both fields have their own methods and evaluation procedures. In ML, Support Vector Machines and Parzen Windows are well-known methods that can be used for outlier detection. In KDD, the heuristic local-density estimation methods LOF and LOCI are generally considered to be superior outlier-detection methods. Hitherto, the performances of these ML and KDD methods have not been compared. This paper formalizes LOF and LOCI in the ML framework of one-class classification and performs a comparative evaluation of the ML and KDD outlier-detection methods on real-world datasets. Experimental results show that LOF and SVDD are the two best-performing methods. It is concluded that both fields offer outlier-detection methods that are competitive in performance and that bridging the gap between both fields may facilitate the development of outlier-detection methods. Keywords-one-class classification; outlier detection; local density estimation I. I NTRODUCTION There is a growing interest in the automatic detection of abnormal or suspicious patterns in large data volumes to detect terrorist activity, illegal financial transactions, or potentially dangerous situations in industrial processes. The interest is reflected in the development and evaluation of outlier-detection methods [1], [2], [3], [4]. In recent years, outlier-detection methods have been proposed in two re- lated fields: Knowledge Discovery (in Databases) (KDD) and Machine Learning (ML). Although both fields have considerable overlap in their objectives and subject of study, there appears to be some separation in the study of outlier- detection methods. In the KDD field, the Local Outlier Factor (LOF) method [3] and the Local Correlation Integral (LOCI) method [4] are the two main methods for outlier detection. Like most methods from KDD, LOF and LOCI are targeted to process large volumes of data [5]. In the ML field, outlier detection is generally based on data description methods inspired by k-Nearest Neighbors (KNNDD), Parzen Windows (PWDD), and Support Vector Machines (SVDD), where DD stands for data description [1], [2]. These methods originate from statistics and pattern recognition, and have a solid theoretical foundation [6], [7]. Interestingly, within both fields the evaluation of outlier- detection methods occurs quite isolated from the other field. In the KDD field, LOF and LOCI are rarely compared to ML methods such as KNNDD, PWDD, and SVDD [3], [4] and in the ML field, LOF and LOCI are seldom mentioned. As a case in point, in Hodge and Austin’s review of outlier detection methods [8], LOF and LOCI are not mentioned at all, while in a recent anomaly-detection survey [9], these methods are compared on a conceptual level, only. The aim of this paper is to treat outlier-detection methods from both fields on an equal footing by framing them in a common methodological framework and by performing a comparative evaluation. To the best of our knowledge, this is the first time that outlier-detection methods from the fields of KDD and ML are evaluated and compared in a statistically valid way. 1 To this end we adopt the one-class classification frame- work [1]. The framework allows outlier-detection methods to be evaluated using the well-known performance measure AUC [11], and to be compared using statistically funded comparison test such as the Friedman test [12] and the post- hoc Nemenyi test [13]. The outlier-detection methods of which the performances are compared are: LOF, LOCI from the field of KDD, and KNNDD, PWDD, and SVDD from the field of ML. In this paper, LOF and LOCI are reformulated in terms of the one- class classification framework. The ML methods have been proposed in terms of the one-class classification framework by De Ridder et al. [14] and Tax [1]. The remainder of the paper is organized as follows. Section II briefly presents the one-class classification frame- work. In Sections III and IV we introduce the KDD and ML outlier-detection methods, respectively, and explain how they compute a measure of outlierness. We describe the set- up of our experiments in Section V and their results in Section VI. Section VII discusses the results in terms of three observations. Finally, Section VIII concludes by stating that the fields of KDD and ML have outlier-detection methods that are competitive in performance and deserve treatment on equal footing. 1 Hido et al. recently compared LOF, SVDD, and several other outlier- detection methods [10]. Unfortunately, their study is flawed because no independent test set and proper evaluation procedure were used.