Hayes and Capretz Journal of Big Data (2015) 2:2 DOI 10.1186/s40537-014-0011-y RESEARCH Open Access Contextual anomaly detection framework for big sensor data Michael A Hayes and Miriam AM Capretz * *Correspondence: mcapretz@uwo.ca Department of Electrical and Computer Engineering, Western University, London, Canada Abstract The ability to detect and process anomalies for Big Data in real-time is a difficult task. The volume and velocity of the data within many systems makes it difficult for typical algorithms to scale and retain their real-time characteristics. The pervasiveness of data combined with the problem that many existing algorithms only consider the content of the data source; e.g. a sensor reading itself without concern for its context, leaves room for potential improvement. The proposed work defines a contextual anomaly detection framework. It is composed of two distinct steps: content detection and context detection. The content detector is used to determine anomalies in real-time, while possibly, and likely, identifying false positives. The context detector is used to prune the output of the content detector, identifying those anomalies which are considered both content and contextually anomalous. The context detector utilizes the concept of profiles, which are groups of similarly grouped data points generated by a multivariate clustering algorithm. The research has been evaluated against two real-world sensor datasets provided by a local company in Brampton, Canada. Additionally, the framework has been evaluated against the open-source Dodgers dataset, available at the UCI machine learning repository, and against the R statistical toolbox. Keywords: Big data analytics; Contextual anomaly detection; Predictive modelling; Multivariate clustering; Streaming sensors Introduction Anomalies are abnormal events or patterns that do not conform to expected events or patterns [1]. Identifying anomalies is important in a broad set of disciplines; including, medical diagnosis, insurance and identity fraud, network intrusion, and programming defects. Anomalies are generally categorized into three types: point, or content anoma- lies; context anomalies, and collective anomalies. Point anomalies occur for data points that are considered abnormal when viewed against the whole dataset. Context anoma- lies are data points that are considered abnormal when viewed against meta-information associated with the data points. Finally, collective anomalies are data points which are considered anomalies when viewed with other data points, against the rest of the dataset. Algorithms to detect anomalies generally fall into three types: unsupervised, super- vised, and semi-supervised [1]. These techniques range from training the detection algorithm using completely unlabelled data, to having a pre-formed dataset with entries labelled normal or abnormal, and to those that rely only partially on external input. A © 2015 Hayes and Capretz; licensee Springer. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly credited.