Continuous Outlier Detection in Data Streams: An Extensible Framework and State-Of-The-Art Algorithms Dimitrios Georgiadis Aristotle University Thessaloniki, Greece dkgeorgi@csd.auth.gr Maria Kontaki Aristotle University Thessaloniki, Greece kontaki@csd.auth.gr Anastasios Gounaris Aristotle University Thessaloniki, Greece gounaria@csd.auth.gr Apostolos Papadopoulos Aristotle University Thessaloniki, Greece papadopo@csd.auth.gr Kostas Tsichlas Aristotle University Thessaloniki, Greece tsichlas@csd.auth.gr Yannis Manolopoulos Aristotle University Thessaloniki, Greece manolopo@csd.auth.gr ABSTRACT Anomaly detection is an important data mining task, aiming at the discovery of elements that show signiﬁcant diversion from the expected behavior; such elements are termed as outliers. One of the most widely employed criteria for de- termining whether an element is an outlier is based on the number of neighboring elements within a ﬁxed distance (R), against a ﬁxed threshold (k). Such outliers are referred to as distance-based outliers and are the focus of this work. In this demo, we show both an extendible framework for outlier de- tection algorithms and speciﬁc outlier detection algorithms for the demanding case where outlier detection is continu- ously performed over a data stream. More speciﬁcally: i) ﬁrst we demonstrate a novel ﬂavor of an open-source pub- licly available tool for Massive Online Analysis (MOA) that is endowed with capabilities to encapsulate algorithms that continuously detect outliers and ii) second, we present four online outlier detection algorithms. Two of these algorithms have been designed by the authors of this demo, with a view to improving on key aspects related to outlier mining, such as running time, ﬂexibility and space requirements. Categories and Subject Descriptors H.2.8 [Database Management]: Database Applications— data mining General Terms Algorithms, Performance Keywords outlier detection, continuous processing, data streams, met- ric space Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for proﬁt or commercial advantage and that copies bear this notice and the full citation on the ﬁrst page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior speciﬁc permission and/or a fee. SIGMOD’13, June 22–27, 2013, New York, New York, USA. Copyright 2013 ACM 978-1-4503-2037-5/13/06 ...$15.00. 1. INTRODUCTION Outlier mining is considered an important task in many applications, such as fraud detection, plagiarism, computer network management, event detection (e.g., in sensor net- works), to name a few. An object is characterized as outlier if it does not show the expected behavior, which means that it corresponds either to noise or to important knowledge. In both cases, these deviating objects must be detected and reported. In this work, we focus on distance-based outliers [6], where objects belong to a metric space, and, thus, the distance function used satisﬁes the triangular inequality. Accord- ing to this deﬁnition, an object x is marked as outlier, if there are less than k objects located at a distance at most R from x. Fig. 1 illustrates an example, where object b is an outlier for k=3, since there are less than 3 objects in the R-neighborhood of b. The rest of the objects are marked as inliers, because there are at least 3 objects in their R- neighborhood. We are interested in outlier detection over data streams [1], where new objects are continuously arrive whereas old ones expire. We follow the count-based sliding window ap- proach, where each time a new object arrives the oldest one expires, thus, keeping the set of active objects constant. The set of active objects is organized by means of a metric- based access method, e.g., an M-tree [5], to facilitate eﬃ- cient range query processing; range query processing is the main approach to computing the number of objects in the R-neighborhood of an object. Mining data streams is more challenging than mining a static set of objects, mainly because of the dynamic nature of the objects. In our case, an object may change its status c d b e a R R Figure 1: Distance-based outlier example for k=3.