Outlier Detection Using k-Nearest Neighbour Graph Ville Hautam¨ aki, Ismo K¨ arkk¨ ainen and Pasi Fr¨ anti University of Joensuu, Department of Computer Science Joensuu, Finland villeh, iak, franti @cs.joensuu.fi Abstract We present an Outlier Detection using Indegree Number (ODIN) algorithm that utilizes k-nearest neighbour graph. Improvements to existing kNN distance -based method are also proposed. We compare the methods with real andsyn- thetic datasets. The results show that the proposed method achieves resonable results with synthetic data and outper- forms compared methods with real data sets with small number of observations. 1. Introduction Outlier is defined as an observation that deviates too much from other observations that it arouses suspicions that it was generated by a different mechanism from other obser- vations [6]. Inlier, on the other hand, is defined as an ob- servation that is explained by underlying probability density function. In clustering, outliers are considered as noise ob- servations that should be removed in order to make more re- liable clustering [5]. In data mining, detection of anomalous patterns in data is more interesting than detecting inlier clus- ters. For example, a breast cancer detection system might consider inlier observations to represent healthy patient and outlier observation as a patient with breast cancer. Similarly computer security intrusion detection system finds an inlier pattern as representation of normal network behaviour and outliers as possible intrusion attempts [13]. The exact definition of an outlier depends on the con- text. Definitions fall roughly into five categories [7]: i) distribution-based, ii) depth-based, iii) distance-based, iv) clustering-based and v) density-based. Distribution-based methods originate from statistics, where observation is con- sidered as an outlier if it deviates too much from underlying distribution. For example, in normal distribution outlier is an observation whose distance from the average observation is three times of the variance [4]. The problem is that in real world cases underlying distribution is usually unknown and cannot be estimated from data without outliers affecting the estimate, thus creating a chicken-egg problem. Distance- based methods [8] define outlier as an observation that is distance away from percentage of observations in the dataset. The problem is then finding appropriate and such that outliers would be correctly detected with a small number of false detections. This process usually needs domain knowledge [8]. In clustering-based methods, outlier is defined to be observation that does not fit to the overall clustering pattern [15]. In density-based methods, outlier is detected from local density of observations. These methods use different den- sity estimation strategies. A low local density on the ob- servation is an indication of a possible outlier. For exam- ple, Brito et al. [1] proposed a Mutual -Nearest Neighbor (MkNN) graph based approach. MkNN graph is a graph where an edge exists between vectors and if they both belong to each others k-neighbourhood. MkNN graph is undirected and is a special case of k-Nearest Neighbour (kNN) graph, in which every node has pointers to its k near- est neighbours. Each connected component is considered as a cluster if, it contains more than one vector and an outlier, when connected component contains only one vector. Ra- maswamy et al. [11] proposed a method, in which largest kNN distances are considered as outliers. This can be seen as “sparseness estimate” of a vector, in which the spars- est vectors are considered as outliers. We name the method RRS according to the original authors’ initials. In this paper, we propose two density-based outlier de- tection methods. In the first method, a vector is defined as an outlier if it participates in at most neighbourhoods in kNN graph, where threshold is a control parameter. To accomplish this we consider kNN graph as a directed prox- imity graph, where the vectors are vertices of the graph and edges are distances between the vectors. We classify a vec- tor as outlier on basis of its indegree number in the graph. The second method, a modification of RRS, sorts all vectors by their average kNN distances, for which a global thresh- old is defined. Vectors with large average kNN -distance are all marked as outliers.