Online Unsupervised State Recognition in Sensor Data (Supplementary Materials) Julien Eberle, Tri Kurniawan Wijaya, and Karl Aberer School of Computer and Communication Sciences ´ Ecole Polytechnique F´ ed´ erale de Lausanne (EPFL) CH-1015 Lausanne, Switzerland Email: {julien.eberle, tri-kurniawan.wijaya, karl.aberer}@epﬂ.ch Abstract—Smart sensors, such as smart meters or smart phones, are nowadays ubiquitous. To be “smart”, however, they need to process their input data with limited storage and computational resources. In this paper, we convert the stream of sensor data into a stream of symbols, and further, to higher level symbols in such a way that common analytical tasks such as anomaly detection, forecasting or state recognition, can still be carried out on the transformed data with almost no loss of accuracy, and using far fewer resources. We identify states of a monitored system and convert them into symbols (thus, reducing data size), while keeping “interesting” events, such as anomalies or transition between states, as it is. Our algorithm is able to ﬁnd states of various length in an online and unsupervised way, which is crucial since behavior of the system is not known beforehand. We show the effectiveness of our approach using real-world datasets and various application scenarios. This document contains the supplementary material of our paper presented at PerCom 2015 [1]. I. REVERTING STATES One of the goals of Spclust and StateFinder is to produce a symbolic time series to support higher level applications, without converting it back to sensor’s original measurement values. Since symbolic time series is much shorter than its original version, this property is desirable, especially due to the limited sensor’s storage and computational power. Thus, we do not discuess the process of reverting symbols back to its original valus in the main paper. However, for the sake of completeness, below we illustrate how one can revert symbolic time series to its original values. Converting symbols level 0 generated by Spclust to its original values In this case, we could simply use the cluster centroids to approximate the original values of the symbols. Note that, using Spclust, we have one-to-one mapping between clusters and symbols. Converting RLE compressed triples to non-compressed triples. Each symbol is repeated n times with n =(t e -t s )/r, where r is the sampling rate of the original data. As sensors may have a variable sampling rate or some gaps, this transfor- mation could produce more/less triples that the original ones. Converting symbols from level 1 or higher to one level lower. The main idea, is to use the Segment Transition Matrix To ﬁnd the starting symbol in the case this one is not available, we use an heuristic that aims to ﬁnd a symbol that has the lowest incoming transition probability and a positive outgoing transition probability. Formally, we take the symbol i such as the sum of the elements in the ith row, except the diagonal is greater than 0 and the sum of the elements on the ith column, except the diagonal, is minimal. Then according to the transition probability to the next symbol, we build a sequence. Even though we might produce a slightly different sequence from the original one, it eventually has the same symbol distribution. II. STATE FORECASTING A. State Forecasting The algorithm is inspired by the pattern-based forecasting in [2], [3], [4]. It consists of three steps: clustering, pattern similarity search, and prediction. Below we give an overview of the algorithm. Let us assume that we have a symbolic time series  S = { s 1 ,...,  s t } available as the training set, a forecast horizon h, and window length w. Our goal is to forecast  S * = { s t+1 ,...,  s t+h }, i.e., sensor values up to h time periods following  S. Clustering Divide  S into a set of symbolic time series, S = [  S 1 , ...  S l ], where each  S i ∈S has length h, and if i<j , then  S i is a series preceding  S j . Cluster the symbolic time series in S , and let c(  S i ) be the cluster label of  S i . Pattern similarity search Find all matching sequences of length w, {  S i ,...,  S i+w-1 }, where i ≤ l - w and [c(  S i ),..., c(  S i+w-1 )] = [c(  S l-w+1 ),...,c(  S l )]. Prediction Let {  S i1+w ,...,  S i k +w } be the predictor set, i.e., the set of series following the matching sequences. The pre- dicted series,  S * , is obtained by aggregating the predictor set. For the clustering step, we used the KMedoids algorithm. Note that, any other unsupervised learning techniques can also be used here. As in other unsupervised learning techniques, however, given different parameter settings, we are often uncertain which setting delivers the best cluster conﬁguration. This holds even if the parameters are very intuitive, such as k in KMedoids, which is the number of clusters to be created. To select the best conﬁguration, we use similar techniques to [3], [5]. We perform the clustering step several times using different conﬁgurations, i.e. different k, and evaluate the resulting conﬁgurations using the Silhouette, Dunn and Davies- Bouldin indices. The best cluster conﬁguration is chosen by majority voting over the three indices. For this experiment, we use GPS traces from the Nokia Lausanne Data Collection Campaign dataset [6]. The ten users