SURF and SURF-PI: A File Format and API for Non- Intrusive Load Monitoring Public Datasets Lucas Pereira, Nuno Nunes Madeira-ITI, University of Madeira Polo Científico e Tecnológico da Madeira, floor -2 Caminho da Penteada, Funchal, Madeira, Portugal lucas.pereira@m-iti.org, njn@uma.pt Mario Bergés Civil and Environmental Engineering, Carnegie Mellon University 5000 Forbes Avenue, Pittsburgh, PA, USA marioberges@cmu.edu ABSTRACT In this paper we propose a common file format and API for public Non-Intrusive Load Monitoring (NILM) datasets such that researchers can easily evaluate their approaches across the different datasets and benchmark their results against prior work. The proposed file format enables storing the power demand of the whole house along with individual appliance consumption, and other relevant metadata in a single compact file, whereas the API supports the creation and manipulation of individual files and datasets in the proposed format. Categories and Subject Descriptors D.2.13 [Software Engineering]: Reusable Software – reusable libraries. Keywords Energy Disaggregation, Datasets, File Format, API. 1. INTRODUCTION NILM, first introduced by George Hart in his seminal work [1], is the process of estimating the energy consumption of individual appliances given only current and voltage measurements taken at a limited number of locations in the electric distribution of a building. Yet, despite decades of research and recent efforts towards creating public datasets (e.g. [2] and [3]) to validate and improve the existing approaches, very few formal evaluations (e.g. [4]) of the technology have been carried out so far, thus raising questions about the large scale applicability of this technology. We argue that one of the reasons for this is the difficulty of objectively comparing the performance of different algorithms given the lack of public datasets and the wide differences between the ones currently available. In fact, only recently there has been a serious effort to homogenize the existing datasets and provide a single interface to run evaluations [5] to which we wish to contribute by proposing SURF and SURF-PI, a common file format and programming interface to support the creation and manipulation of public NILM datasets, to help homogenize the whole process of systematically evaluating NILM algorithms across different datasets. 2. SURF FILE FORMAT The proposed format is an extension of the Waveform Audio File Format (WAVE) that supports the storage of digital audio data and metadata annotations according to the underlying chunk structure that is defined by the Resource Interchange File Format (RIFF) standard. There are four main reasons behind expanding this format and not another: i) data and metadata are all stored in a single compact file, thus limiting the number of artifacts that need to be managed; ii) the possibility of adding custom chunks without breaking the file consistency; iii) the resulting files are optimized to have little overhead; and iv) the mature programming interfaces that exist for a diversity of programming languages, hence facilitating the expansion and portability of the proposed format and API. The SURF file format is currently composed of 13 chunks each one containing its own header and data bytes. Eight chunks are inherited from the WAVE format, one from the RIFF standard, and the remaining four are custom chunks created to supplement the files with relevant metadata. Next we describe the underlying structure of the SURF format. 2.1 Power Demand Data The power demand data is defined in the Format chunk (sampling rate, sample size and channels) and stored in the Data chunk. The data values are stored uncompressed (to preserve the original signal) in little-endian byte order and scaled to the interval ]-1,1[. 2.2 Individual Appliance and User Activities Individual appliance activities correspond to the changes in the power demand that are triggered when individual loads change their operating mode (e.g. going from on to off and vice-versa), whereas user activities are groups of related individual appliances activities (e.g. combine the clothes-washer, dryer and iron activities to form the “laundry” user activity). All these activities have a corresponding timestamp (user activities also have an end timestamp) that are mapped to the corresponding sample number in the power demand data. These activities are embedded in the SURF files using the Cue, Associated Data List, Label and Labeled Text chunks. Each activity is represented by their respective positions in the power demand data and a JSON formatted string with its details (see figure 1). For example, the following two JSON strings Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from permissions@acm.org. e-Energy’14, June 11--13, 2014, Cambridge, UK. Copyright is held by the owner/author(s). Publication rights licensed to ACM. ACM 978-1-4503-2819-7/14/06…$15.00. http://dx.doi.org/10.1145/2602044.2602078 225