IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, REVISED OCTOBER 2014 1 Reverse Nearest Neighbors in Unsupervised Distance-Based Outlier Detection Miloˇ s Radovanovi´ c, Alexandros Nanopoulos, and Mirjana Ivanovi´ c Abstract—Outlier detection in high-dimensional data presents various challenges resulting from the “curse of dimensionality.” A prevailing view is that distance concentration, i.e., the tendency of distances in high-dimensional data to become indiscernible, hinders the detection of outliers by making distance-based methods label all points as almost equally good outliers. In this paper we provide evidence supporting the opinion that such a view is too simple, by demonstrating that distance-based methods can produce more contrasting outlier scores in high-dimensional settings. Furthermore, we show that high dimensionality can have a different impact, by reexamining the notion of reverse nearest neighbors in the unsupervised outlier-detection context. Namely, it was recently observed that the distribution of points’ reverse-neighbor counts becomes skewed in high dimensions, resulting in the phenomenon known as hubness. We provide insight into how some points (antihubs) appear very infrequently in k-NN lists of other points, and explain the connection between antihubs, outliers, and existing unsupervised outlier-detection methods. By evaluating the classic k-NN method, the angle-based technique (ABOD) designed for high-dimensional data, the density-based local outlier factor (LOF) and influenced outlierness (INFLO) methods, and antihub-based methods on various synthetic and real-world data sets, we offer novel insight into the usefulness of reverse neighbor counts in unsupervised outlier detection. Index Terms—Outlier detection, reverse nearest neighbors, high-dimensional data, distance concentration ✦ 1 I NTRODUCTION O UTLIER (anomaly) detection refers to the task of identifying patterns that do not conform to established regular behavior [1]. Despite the lack of a rigid mathematical definition of outliers, their de- tection is a widely applied practice [2]. The interest in outliers is strong since they may constitute critical and actionable information in various domains, such as intrusion and fraud detection, and medical diagnosis. The task of detecting outliers can be categorized as supervised, semi-supervised, and unsupervised, de- pending on the existence of labels for outliers and/or regular instances. Among these categories, unsuper- vised methods are more widely applied [1], because the other categories require accurate and representa- tive labels that are often prohibitively expensive to obtain. Unsupervised methods include distance-based methods [3], [4], [5] that mainly rely on a measure of distance or similarity in order to detect outliers. A commonly accepted opinion is that, due to the “curse of dimensionality,” distance becomes meaning- less [6], since distance measures concentrate, i.e., pair- wise distances become indiscernible as dimensionality increases [7], [8]. The effect of distance concentration on unsupervised outlier detection was implied to be that every point in high-dimensional space becomes • M. Radovanovi´ c and M. Ivanovi´ c are with the Faculty of Sciences, University of Novi Sad, Serbia. E-mail: {radacha, mira}@dmi.uns.ac.rs • A. Nanopoulos is with the Ingolstadt School of Management, University of Eichstaett-Ingolstadt, Germany. E-mail: alexandros.nanopoulos@ku.de an almost equally good outlier [9]. This somewhat simplified view was recently challenged [10]. Our motivation is based on the following factors: (1) It is crucial to understand how the increase of di- mensionality impacts outlier detection. As explained in [10] the actual challenges posed by the “curse of dimensionality” differ from the commonly accepted view that every point becomes an almost equally good outlier in high-dimensional space [9]. We will present further evidence which challenges this view, motivating the (re)examination of methods. (2) Reverse nearest-neighbor counts have been pro- posed in the past as a method for expressing outlier- ness of data points [11], [12], 1 but no insight apart from basic intuition was offered as to why these counts should represent meaningful outlier scores. Recent observations that reverse-neighbor counts are affected by increased dimensionality of data [14] war- rant their reexamination for the outlier-detection task. In this light, we will revisit the ODIN method [11]. Our contributions can be summarized as follows: (1) In Section 3 we discuss the challenges that un- supervised outlier detection faces in high-dimensional space. Despite the general impression that all points in a high-dimensional data set seem to become out- liers [9], we show that unsupervised methods can detect outliers which are more pronounced in high dimensions, under the assumption that all (or most) data attributes are meaningful, i.e. not noisy. Our findings complement the observations from [10] by 1. To prevent confusion, it needs to be noted that the paper [12] incorrectly cites earlier work [13] by the second author of the present article as the source of the reverse-neighbor method.