Anomaly detection in streaming environmental sensor data: A data-driven modeling approach David J. Hill a, * , Barbara S. Minsker b a Department of Civil and Environmental Engineering, Rutgers University, 623 Bowser Rd, Piscataway, NJ 08854, USA b Department of Civil and Environmental Engineering, University of Illinois Urbana-Champaign, 205 N. Mathews Ave., Urbana, IL 61801, USA article info Article history: Received 9 March 2009 Received in revised form 25 August 2009 Accepted 25 August 2009 Available online 24 October 2009 Keywords: Coastal environment Data-driven modeling Anomaly detection Machine learning Real-time data Sensor networks Data quality control Artificial intelligence abstract The deployment of environmental sensors has generated an interest in real-time applications of the data they collect. This research develops a real-time anomaly detection method for environmental data streams that can be used to identify data that deviate from historical patterns. The method is based on an autoregressive data-driven model of the data stream and its corresponding prediction interval. It performs fast, incremental evaluation of data as it becomes available, scales to large quantities of data, and requires no pre-classification of anomalies. Furthermore, this method can be easily deployed on a large heterogeneous sensor network. Sixteen instantiations of this method are compared based on their ability to identify measurement errors in a windspeed data stream from Corpus Christi, Texas. The results indicate that a multilayer perceptron model of the data stream, coupled with replacement of anomalous data points, performs well at identifying erroneous data in this data stream. Ó 2009 Published by Elsevier Ltd. 1. Introduction In-situ environmental sensors are sensors that are physically located in the environment they are monitoring. Through telemetry, the time-series data collected by these sensors can be transmitted continuously to a repository as a data stream. Recently, there have been efforts to make use of streaming data for real-time applications (e.g., Bonner et al., 2002). For example, draft plans for the Water and Environmental Research Systems (WATERS) Network, a proposed national environmental observatory network, have identified real- time analysis and modeling as a significant priority (NRC 2006). Because in-situ sensors operate under harsh conditions, and because the data they collect must be transmitted across commu- nication networks, the data can easily become corrupted. Unde- tected errors can significantly affect the data’s value for real-time applications. Thus, the NSF (National Science Foundation), 2005 has indicated a need for automated data quality assurance and control (QA/QC). Anomaly detection is the process of identifying data that deviate markedly from historical patterns (Hodge and Austin, 2004). Anomalous data can be caused by sensor or data trans- mission errors or by infrequent system behaviors that are often of interest to scientific and regulatory communities. In addition to data QA/QC, where data anomalies may be the result of sensor or telemetry errors, anomaly detection has many other practical applications, such as adaptive monitoring, where anomalous data indicate phenomena that researchers may wish to investigate further through increased sampling, and anomalous event detec- tion, where anomalous data signal system behaviors that require other actions to be taken, for example in the case of a natural disaster. These applications require that data anomalies be identi- fied in near-real time; thus, the anomaly detection method must be rapid and be performed incrementally to ensure that detection keeps up with the rate of data collection. Traditionally, anomaly detection has been carried out manually with the assistance of data visualization tools (Mourad and Bertrand- Krajewski, 2002), but manual methods are unsuitable for real-time detection in streaming data, since they necessitate an operator to be performing analysis 24 h a day, 7 days a week. More recently, researchers have suggested automated statistical and machine learning approaches, such as minimum volume ellipsoid (Rousseeuw and Leroy, 1996), convex pealing (Rousseeuw and Leroy, 1996), nearest neighbor (Tang et al., 2002; Ramaswamy et al., 2000), clustering (Bolton and Hand, 2001), neural network classifier * Corresponding author. Tel.: þ1 217 714 3490. E-mail addresses: ecodavid@rci.rutgers.edu (D.J. Hill), minsker@illinois.edu (B.S. Minsker). Contents lists available at ScienceDirect Environmental Modelling & Software journal homepage: www.elsevier.com/locate/envsoft 1364-8152/$ – see front matter Ó 2009 Published by Elsevier Ltd. doi:10.1016/j.envsoft.2009.08.010 Environmental Modelling & Software 25 (2010) 1014–1022