Proposal of a new stability concept to detect changes in unsupervised data streams Rosane M.M. Vallim ⇑ , Rodrigo F. de Mello ICMC, Universidade de São Paulo, Av. Trabalhador São Carlense 400, São Carlos, SP 13566-590, Brazil article info Keywords: Data streams Unsupervised change detection Surrogate stability Surrogate data abstract Learning from continuous streams of data has been receiving an increasingly attention in the last years. Among the many challenges related to mining data streams, change detection is one topic frequently addressed. Being able to determine whether or not data characteristics are changing along time is a major concern for data stream algorithms, be it on the supervised or unsupervised scenario. The unsupervised scenario is particularly relevant due to many practical applications do not provide target labeling infor- mation. In this scenario, most of the strategies induce consecutive models over time and compare them in order to detect data changes. In this situation, model changes are assumed to be a consequence of data modiﬁcations. However, there is no guarantee this assumption is true, since those algorithms do not rely on any theoretical background to ensure that model divergences truly indicate data changes. The need for such theoretical framework has motivated this paper to propose a new stability concept to establish bounds on the learning abilities of unsupervised algorithms designed to detect changes on data streams. This stability concept, based on the surrogate data strategy from time series analysis, provides learning guarantees for online unsupervised algorithms even in case of time dependency among observations. Furthermore, we propose a new change detection algorithm that meets the requirements of this stability concept. Experimental results on different synthetical scenarios illustrate how the stability concept pro- posed in this paper is applied to detect changes in unsupervised data streams. Ó 2014 Elsevier Ltd. All rights reserved. 1. Introduction Data Stream Mining is an active area of research that is con- cerned with the development of algorithms capable of learning models from data streams. Data streams are ordered, inﬁnite sequences of data, that become available along time (Gama & Rodrigues, 2007). Due to its inﬁnite nature, researchers usually assume the probability distribution responsible for generating a stream is not ﬁxed nor stationary, consequently data characteris- tics evolve over time. This evolving aspect has motivated several studies to design algorithms to detect when data is actually chang- ing, allowing for efﬁcient and effective model reinduction in the presence of new data behavior. In the supervised learning scenario, the most common strate- gies for detecting data changes monitor some performance measures of the induced model, such as accuracy or precision (Gama, Medas, Castillo, & Rodrigues, 2004). If these measures fall below a stablished threshold, then the current model is considered outdated and, therefore, no longer useful for making predictions about data. A change in the data distribution is then issued, and the model is reinduced using new data. This strategy gives a fair indication on changes, however it requires labeled examples. Unfortunately, most of real-world applications only provide unsu- pervised data, meaning there is no a prioriknowledge to consider when inducing models. In this context, many researchers have been designing cluster- ing techniques to approach the unsupervised scenario, as well as measures to monitor clustering evolution along time in an attempt to detect data changing behavior (Albertini & Mello, 2010; Marsland, Shapiro, & Nehmzow, 2002; Vallim, Filho, de Mello, & de Carvalho, 2013). These strategies assume that changes observed in the induced models indicate changes in data characteristics. However, those strategies have no formal guarantee due to algo- rithm parameters can lead to model adaptations that may not cor- respond to data modiﬁcations. The lack of learning guarantees for unsupervised scenarios has motivated Carlsson and Memoli (2010) to propose a stability con- cept for unsupervised batch learning, which is formalized in terms of model divergences when input data is subject to order perturba- tions. According to this concept, an algorithm is proven to be stable http://dx.doi.org/10.1016/j.eswa.2014.06.031 0957-4174/Ó 2014 Elsevier Ltd. All rights reserved. ⇑ Corresponding author. Tel.: +55 16 9607 9831. E-mail addresses: rosane.maffei@gmail.com (R.M.M. Vallim), mello@icmc.usp.br (R.F. de Mello). Expert Systems with Applications 41 (2014) 7350–7360 Contents lists available at ScienceDirect Expert Systems with Applications journal homepage: www.elsevier.com/locate/eswa