WIP: Towards Optimal Online Approximation of Data Streams Phillip Sitbon, Nirupama Bulusu, Wu-chi Feng Portland State University E-mail: {sitbon,nbulusu,wuchi}@cs.pdx.edu Abstract—In this paper, we provide a basic solution for online compression of data streams using error-bounded piecewise- linear approximation (PLA). We compare this method to the optimal (but offline) solution. Our current work in progress is developing an online PLA method that meets the same optimality constraints as the offline method. Also, the vertices of the constructed approximations are subsets of the sampled data points, which we believe to be a benefit in many scenarios. I. I NTRODUCTION In many applications, data is presented as a continuous stream in which the most recent data is expected to be available with low latency from the source. For example, stock market data is most useful in a one minute window, and traffic data more than an hour old often cannot represent current condi- tions accurately. In simplistic terms, network applications can take several online approaches toward handling data streams, some of which may resemble the following: Pass every value without manipulation: generates large network traffic. Average data over a fixed period: reduces network traffic but also data accuracy corresponding to the averaging period. Maintain a sliding window: only transmit data when significant changes occur, thus balancing network traffic and data accuracy. Resources are often constrained in terms of power (wireless sensor networks), cost (3G cellular networks), or capability (acoustic networks). Additionally, many-to-one networking applications can suffer from DDOS (Distributed Denial-of- Service) effects as scale increases, thus requiring undesired reductions in data quality in order to maintain scalability. For these reasons, there is significant motivation to reduce data volume while maintaining robustness. Although many solutions to online stream compression exist, they provide an optimally small number of data points at the cost of throwing out the original data; conversely, optimal approximations using connected segments of actual data points are calculated offline. Our work in progress is developing a piecewise linear method to approximate data streams with a minimal subset while also providing an accuracy guarantee. Providing a minimal subset requires choosing data points only from the generated sets without introducing averaged or otherwise in- terpolated points. This allows for the up-front transmission and processing of approximated data; additionally, untransmitted data can be retained for later transmission when network traffic is lower or when power reserves are restored. Because all data values are original sensor readings, any additional data will “fill in the gaps,” thus providing additional resolution for statistical analysis and reporting. In this paper, we define the optimal (minimal subset) mea- sure of a data stream, which is calculated in an offline manner using a dynamic programming algorithm. We then devise an online method of piecewise linear approximation. It is a greedy approach in which only changes beyond a given error bound are recorded. To evaluate this method, we use uniform two- dimensional position data; however, these methods can also be applied to time-series sensor readings. Our preliminary results indicate that the greedy approx- imation method achieves impressive compression ratios for mobility data, while still providing the benefits of online data streaming. Our eventual goal with this work is to provide a data stream that achieves the optimal compression for any given error bound and, unlike current solutions, is also online. If possible, we will present preliminary mathematical results in our endeavors. II. RELATED WORK Significant effort has been dedicated to approximating sequential data within a guaranteed bound, for time-series data [1]–[3] and higher-dimensional data as well [4]. Some approximation methods are more indirect when related to networking, such as fuzzy [5], [6] and aggregation [5], [7], [8] methods, but almost all perform some form of linear fit. Common terminology for sequential data approximation includes filters, such as swing filters or slide filters [2]. Elmeleegy et al. define slide filters as disjoint piecewise approximations that are sequentially adjusted in order to minimize residual error, and this method is most similar to our greedy approximation method. Kiely et al. propose an “Adap- tive Linear Filtering Compression” algorithm as a lossless compression algorithm for sensor networks, in which the filter aspect is used to predict sample values which are corrected in later transmissions if wrong [9]. In our work, we chose to keep all data segments connected without reverse correction or data prediction. Because of this, we are able to choose only actual data points from data sources without interpolating new points to facilitate an increase in approximation accuracy. Keogh et al. define a sliding window method in their dis- cussion and consider it in the more common notion, although it is not similar to sliding filters above [1]. It is, however, very similar to the greedy algorithm used here and the authors also mention that it is widely used due to its online nature, for example in frequent-patterns discovery [10]. Keogh et al.’s