The Sparse Regression Cube: A Reliable Modeling Technique for Open Cyber-physical Systems Hossein Ahmadi, Tarek Abdelzaher, Jiawei Han, Nam Pham Department of Computer Science, University of Illinois at Urbana-Champaign {hahmadi2, zaher, hanj, nampham2}@illinois.edu Raghu K. Ganti IBM T. J. Watson Research Center rganti@us.ibm.com Abstract—Understanding the end-to-end behavior of complex systems where computing technology interacts with physical world properties is a core challenge in cyber-physical comput- ing. This paper develops a hierarchical modeling methodology for open cyber-physical systems that combines techniques in estimation theory with those in data mining to reliably cap- ture complex system behavior at different levels of abstraction. Our technique is also novel in the sense that it provides a measure of confidence in predictions. An application to green transportation is discussed, where the goal is to reduce vehicular fuel consumption and carbon footprint. First-principle models of cyber-physical systems can be very complex and include a large number of parameters, whereas empirical regression models are often unreliable when a high number of parameters is involved. Our new modeling technique, called the Sparse Regression Cube, simultaneously (i) partitions sparse, high-dimensional measure- ments into subspaces within which reliable linear regression models apply and (ii) determines the best reliable model for each partition, quantifying uncertainty in output prediction. Evaluation results show that the framework significantly im- proves modeling accuracy compared to previous approaches and correctly quantifies prediction error, while maintaining high efficiency and scalability. I. I NTRODUCTION A fundamental challenge in cyber-physical computing is to accurately capture the end-to-end behavior of large systems in which software interacts with a complex physical environment. This paper presents a new hierarchical modeling technique that is a result of an interdisciplinary effort to combine the best of estimation-theory and data mining techniques to enable mod- eling such systems reliably at multiple degrees of abstraction. A reliable model is the one that remains sufficiently accurate over the whole input range. We use green transportation as a running example, where software routing optimizations use physical models of cars, streets, and traffic conditions to enable energy savings. In this case the “system” is the collection of cars traveling on different roads under different traffic conditions. We show that using our new modeling techniques, we are able to significantly improve the accuracy Research was sponsored in part by Natural Sciences and Engineering Research Council of Canada (NSERC) and by the Army Research Laboratory under Cooperative Agreement Number W911NF-09-2-0053. The views and conclusions contained in this document are those of the authors and should not be interpreted as representing the official policies, either expressed or implied, of the Army Research Laboratory or the U.S. Government. The U.S. Government is authorized to reproduce and distribute reprints for Government purposes notwithstanding any copyright notation here on. of fuel consumption predictions, while quantifying prediction accuracy, and hence the quality of green routes. We are especially interested in modeling open systems where some components, interactions, processes, or constraints are not well-understood or not measured. For example, pre- dicting the fuel consumption of a vehicle depends not only on fixed factors such as weight, frontal area, and engine type, but also on variables such as vehicle speed, acceleration, congestion patterns, and idle time, which are hard to predict accurately in advance. A single MPG rating (miles per gallon in highway and city) is quite inadequate. For instance, it cannot help decide which of two alternative city routes will consume less fuel. Building first principle models from scratch is not always practical, as too many parameters are involved. In contrast, using regression to estimate model coefficients is challenging because reliable estimation suffers the curse of dimensionality. The state space grows exponentially in the number of parameters, making sampling of that space sparse. As the number of parameters increases, estimated models become less reliable. This paper proposes the Sparse Regression Cube modeling technique. It jointly (i) partitions sparse, high-dimensional data into subspaces within which reliable linear regression models apply and (ii) determines the best such model for each partition using standard regression tools. Importantly, sparse regression cubes uncover the inherent generalization hierarchy across such subspaces. For instance, in the example of predicting fuel efficiency of cars on different roads (as a function of car and road parameters), sparse regression cubes will tell how best to categorize cars for purposes of building fuel prediction models in each category. Categorization could be by car class, make, model, manufacturer, year, or other attributes. These categories have a hierarchical structure. For example, one may build prediction models for cars by make, model and year (e.g., Ford Taurus 2005, Toyota Celica 2000). One may also aggregate these over years or over car models to generate prediction models for larger categories (e.g., all Ford Taurus cars, or all Toyotas of 2000). Such generalizations help when there is not enough data on each type of car to build a reliable model for that type alone. They are also good for predicting performance of a car from performance of others (in the same generalized category). Hence, finding accurate generalizations is an interesting problem in cyber-physical systems where sampling is sparse and the number of parameters is large.