2016 IEEE International Conference on Big Data (Big Data) 978-1-4673-9005-7/16/$31.00 ©2016 IEEE 3087 Leveraging User Expertise in Collaborative Systems for Annotating Energy Datasets Hông-Ân Cao, Felix Rauchenstein Department of Computer Science ETH Zurich, Switzerland Email: hong-an.cao@inf.ethz.ch, rafelix@student.ethz.ch Tri Kurniawan Wijaya, Karl Aberer Department of Computer Science EPFL, Switzerland Email: {tri-kurniawan.wijaya, karl.aberer}@epfl.ch Nuno Nunes Madeira Interactive Technologies Institute Funchal, Portugal Email: njn@uma.pt Abstract—While tasks such as segmenting images or deter- mining the sentiment expressed in a sentence can be assigned to regular users, some others require background knowledge and thus, the selection of expert users. In the case of energy datasets, acquiring data represents an obstacle to develop data-driven methods, due to prohibitive monetary and time costs linked to the instrumentation of households in order to monitor the energy consumption. More so, most datasets only contain pure power time series, despite labels being required to determine when a device is in use from when it is idle (incurring stand-by consump- tion or being off), and by extension to separate human activities triggering the consumption from the baseline consumption. We build upon our Collaborative Annotation Framework for Energy Datasets (CAFED) to evaluate and distinguish the performance of expert users against that of regular users. Through a user study with curated benchmark annotation tasks, we provide data- driven and efficient techniques to detect weak and adversarial workers and promote users when the contributors’ user-base is limited. Additionally, we show that if carefully selected, the seed gold standard tasks can be reduced to a small number of tasks that are representative enough to determine the user’s expertise and predict crowd-combined annotations with high precision. Index Terms—Time series analysis; Data mining; Information search and retrieval; Collaboration; Crowdsourcing; Smart en- ergy; Smart meters; Energy data analytics; Datasets; Algorithms I. I NTRODUCTION The development of learning algorithms entices the usage of data to improve and evaluate the accuracy of their outcome. Before the spread of online platforms, acquiring ground truth data was tedious as it was difficult and costly to recruit workers to perform specific tasks. These were then often solved by benevolent lab mates and it took considerable time to collect those datasets. Nowadays, the majority of the micro-tasks that are present on Amazon Mechanical Turk or CrowdFlower consist of image and text labeling and have contributed to build large scale datasets that have allowed progress in the fields of computer vision and natural language processing. However, the introduction of a monetary gain instead of the benevolence of fellow researchers or acquaintances to label such data can lead to the abuse of the system to increase workers’ remuneration, at the expense of the quality of the data. While obtaining labels for text or image content can be distributed to a larger audience of workers due to the nature of the tasks themselves, and can piggyback on existing systems such as CAPTCHAs, crowdsourcing tasks for different fields such as labeling genes or locating volcanoes in satellite images would require domain knowledge expertise that is not widely available to the general public. Energy analytics, where data are obtained through the instrumentation of households to ob- tain power data from dwellings, has benefited from the adop- tion of smart meters to replace semesterly or yearly reporting. These enabled the release of datasets collected by different research institutes and organizations with household-level ag- gregated load consumption at finer granularity. However, for the development of human activity-level or more generally event-based algorithms linked to the consumption of energy caused by households’ residents, more labels that can be used for training and testing the algorithms are required. This is due to the fact new datasets have to be collected to include more appliances and real-time annotations from the residents: existing datasets have the shortcomings of having either been collected at coarser time granularities, for shorter periods, including few appliances (sometimes having only aggregated household consumption) or simply without event-based labels (appliances’ states or human activities). High monetary costs to successfully and reliably carry out data collections have hindered the advances in this domain. They are mostly related to the complexity in instrumenting households: the type of electrical appliances and the electrical wiring can force the sub-metering to be performed at the circuit level, requires expensive hardware and the assistance of certified electricians, preventing the usage of cheaper alternatives such as smart plugs that can be inserted between the appliance’s socket and the electrical outlet. Our Collaborative Annotation Framework for Energy Datasets (CAFED) 1 [1] represented the first effort to retrofit labeling on an existing dataset by leveraging the wisdom of domain experts to annotate an appliance as being active or idle, based on the time series representing its power consumption. The contribution of this paper consists in i) providing a 1 https://cafed.inf.ethz.ch