Hayes and Capretz Journal of Big Data (2015) 2:2
DOI 10.1186/s40537-014-0011-y
RESEARCH Open Access
Contextual anomaly detection framework for
big sensor data
Michael A Hayes and Miriam AM Capretz
*
*Correspondence:
mcapretz@uwo.ca
Department of Electrical and
Computer Engineering, Western
University, London, Canada
Abstract
The ability to detect and process anomalies for Big Data in real-time is a difficult task.
The volume and velocity of the data within many systems makes it difficult for typical
algorithms to scale and retain their real-time characteristics. The pervasiveness of data
combined with the problem that many existing algorithms only consider the content of
the data source; e.g. a sensor reading itself without concern for its context, leaves room
for potential improvement. The proposed work defines a contextual anomaly detection
framework. It is composed of two distinct steps: content detection and context
detection. The content detector is used to determine anomalies in real-time, while
possibly, and likely, identifying false positives. The context detector is used to prune the
output of the content detector, identifying those anomalies which are considered both
content and contextually anomalous. The context detector utilizes the concept of
profiles, which are groups of similarly grouped data points generated by a multivariate
clustering algorithm. The research has been evaluated against two real-world sensor
datasets provided by a local company in Brampton, Canada. Additionally, the
framework has been evaluated against the open-source Dodgers dataset, available at
the UCI machine learning repository, and against the R statistical toolbox.
Keywords: Big data analytics; Contextual anomaly detection; Predictive modelling;
Multivariate clustering; Streaming sensors
Introduction
Anomalies are abnormal events or patterns that do not conform to expected events or
patterns [1]. Identifying anomalies is important in a broad set of disciplines; including,
medical diagnosis, insurance and identity fraud, network intrusion, and programming
defects. Anomalies are generally categorized into three types: point, or content anoma-
lies; context anomalies, and collective anomalies. Point anomalies occur for data points
that are considered abnormal when viewed against the whole dataset. Context anoma-
lies are data points that are considered abnormal when viewed against meta-information
associated with the data points. Finally, collective anomalies are data points which are
considered anomalies when viewed with other data points, against the rest of the dataset.
Algorithms to detect anomalies generally fall into three types: unsupervised, super-
vised, and semi-supervised [1]. These techniques range from training the detection
algorithm using completely unlabelled data, to having a pre-formed dataset with entries
labelled normal or abnormal, and to those that rely only partially on external input. A
© 2015 Hayes and Capretz; licensee Springer. This is an Open Access article distributed under the terms of the Creative Commons
Attribution License (http://creativecommons.org/licenses/by/4.0), which permits unrestricted use, distribution, and reproduction
in any medium, provided the original work is properly credited.