Data Analysis Tools for Sensor-Based Science Stuart Ozer, Jim Gray Microsoft Research San Francisco, CA {struarto,Jim.Gray} @microsoft.com Alex Szalay, Andreas Terzis, Razvan Musaloiu-E. Johns Hopkins University Baltimore, MD {szalay,terzis,razvanm,}@jhu.edu Katalin Szlavecz, Randal Burns, Josh Cogan Johns Hopkins University Baltimore, MD {szlavecz, randal, joshc}@jhu.edu ABSTRACT Science is increasingly driven by data collected automati- cally from arrays of inexpensive sensors. The collected data volumes require a different approach from the scientists' current Excel spreadsheet storage and analysis model. Spreadsheets work well for small data sets; but scientists want high level summaries of their data for various statisti- cal analyses without sacrificing the ability to drill down to every bit of the raw data. This demonstration describes our prototype data analysis system that is suitable for browsing and visualization – like a spreadsheet – but scalable to much larger data sets Categories and Subject Descriptors H.3.3. [Information Research and Retrieval]: Information filtering. General Terms Management, Measurement, Experimentation, Human Fac- tors. Keywords: Sensor Networks, Data Cubes. 1. INTRODUCTION While the proposed approach is applicable to any Wireless Sensor Network that generates large amounts of data, col- lected by large collections of sensors over long periods of time, we ground our design through a environmental moni- toring application we developed and deployed during the fall of 2005 [3]. The purpose of our WSN is soil monitoring, in which motes periodically collect soil measurements including soil temperature and soil humidity, as well as ambient tempera- ture and light. Measurements are stored locally on the motes' flash memory until they are retrieved by a network gateway using a reliable transfer protocol. All collected measurements are subsequently inserted to a relational database. This database not only stores "raw" measurements, but also calibrated versions, calculated us- ing stored procedures, and drives user interfaces including HTML and Web Services front-ends [2]. As a way to allow arbitrary analyses, the Web interface allows SQL queries to be sent directly to the database. This "guru" interface has already been very useful but at the same time domain scien- tists prefer to interact with visual and high-level summari- zation tools rather than having to use a different set of tools to analyze data extracted from the database. The data analysis tools presented in the following section provide this service. 2. DATA CUBES FOR DATA ANALYSIS The calibrated and interpolated data, available in the rela- tional database, can answer a variety of scientific questions exploring both the time and spatial dimensions for small soil ecosystems. However, equally important to examining individual measurements and looking for unusual cases, ecologists want a high level view of the measured quanti- ties. They want to analyze aggregations and functions of the sensor data, visualize trends, and cross-correlate them with other biological measurements at many different scales. These requirements for slicing, aggregation and analysis can be summarized by general ad-hoc query requests such as: (1) Display the functions of measurements (e.g., aver- age, min, max, standard deviation) for a particular time or time interval, for one sensor, for a patch, for all sensors at a site, or for all sites. (2) Show the results as a function of depth, time, and category (land cover, age of vegetation, crop management type, upslope, downslope, etc.). site patch node sensor type depth tenMinute hour day week year make/model day of year wk. of year hour of day all all measurement type Sensor Dimension Measurement Type Dimension Time Dimension Measures (sum, count, min, max, median, std deviation) Figure 1. Sensor data cube dimensional model. Copyright is held by the author/owner(s). SenSys'06, November 1–3, 2006, Boulder, Colorado, USA. ACM 1-59593-343-3/06/0011. 341