INCREMENTAL METHODS FOR DETECTING OUTLIERS FROM MULTIVARIATE DATA STREAM Simon Fong a , Zhicong Luo a , Bee Wah Yap b , Suash Deb c Department of Computer and Information Science, University of Macau, Macau SAR, China a Faculty Computer and Mathematical Sciences, Universiti Teknologi MARA, Selangor, Malaysia b Department of Computer Science and Engineering, Cambridge Institute of Technology, Ranchi, India c ccfong@umac.mo ABSTRACT Outlier detection is one of the most important data mining techniques. It has broad applications like fraud detection, credit approval, computer network intrusion detection, anti-money laundering, etc. The basis of outlier detection is to identify data points which are “different” or “far away” from the rest of the data points in the given dataset. Traditional outlier detection method is based on statistical analysis. However, this traditional method has an inherent drawback—it requires the availability of the entire dataset. In practice, especially in the real time data feed application, it is not so realistic to wait for all the data because fresh data are streaming in very quickly. Outlier detection is hence done in batches. However two drawbacks may arise: relatively long processing time because of the massive size, and the result may be outdated soon between successive updates. In this paper, we propose several novel incremental methods to process the real time data effectively for outlier detection. For the experiment, we test three types of mechanisms for analyzing the dataset, namely Global Analysis, Cumulative Analysis and Lightweight Analysis with Sliding Window. The experiment dataset is “household power consumption” which is a popular benchmarking data for Massive Online Analysis. KEY WORDS Outlier detection; Incremental processing; Data stream mining. 1. Introduction: Background of Outlier Detection Techniques Numerous researchers have attempted to apply different techniques in detecting outlier, which are generally referred to as defined in the following. “An outlier is an observation that deviates so much from other observations as to arouse suspicions that is was generated by a different mechanism” (Hawkins, 1980). “An outlier is an observation (or subset of observations) which appear to be inconsistent with the remainder of the dataset” (Barnet & Lewis, 1994). Researchers generally focus on the observation of data irregularities, how each data instance relates to the others (the majority), and how such data instances relate to classification performance. Most of these techniques can be grouped into the following three categories: distribution-based, distance-based, and density-based methods. 1.1 Distribution-based Outlier Detection Methods These methods are commonly based on statistical analysis. Detection techniques proposed in the literature range from finding extreme values beyond a certain number of standard deviations to complex normality tests. However, most distribution models typically apply directly to the future space and are they univariate. Therefore, they are unsuitable even for moderately high- dimensional data sets. Grubbs proposed a notion which calculates a Z value as the difference between the mean value for the attribute and the query value divided by the standard deviation for the attribute, where the mean and the standard deviation are calculated from all attribute values including the query value. The Z value for the query is compared with a 1% or 5% significance level. The technique requires no pre-defined parameters as all parameters are directly derived from the data. However, the success of this approach heavily depends on the number of exemplars in the data set. The higher the number of records, the more statistically representative the sample is likely to be [1]. In [2], the authors adopted a special outlier detection approach in which the behavior projected by the dataset is examined. If a point is sparse in a lower low-dimensional projection, the data it represents are deemed abnormal and are removed. Brute force, or at best, some form of heuristics, is used to determine the projections. A similar method outlined by [3] builds a height-balanced tree containing clustering features on non-leaf nodes and leaf nodes. Leaf nodes with a low density are then considered outliers and are filtered out. 1.2 Distance-based or Similarity-based Outlier Detection Methods Distance-based outlier detection techniques are initially introduced by Knorr and Ng [4]. An object p in a data set DS is a DB(q,dist)-outlier if at least fraction q of the objects in DS lie at a greater distance than dist from p. This definition is well accepted, since it generalizes Proceedings of the IASTED International Conference February 17 - 19, 2014 Innsbruck, Austria Artificial Intelligence and Applications (AIA 2014) DOI: 10.2316/P.2014.816-006 369