2016 IEEE International Conference on Big Data (Big Data)
978-1-4673-9005-7/16/$31.00 ©2016 IEEE 3087
Leveraging User Expertise in Collaborative Systems
for Annotating Energy Datasets
Hông-Ân Cao, Felix Rauchenstein
Department of Computer Science
ETH Zurich, Switzerland
Email: hong-an.cao@inf.ethz.ch,
rafelix@student.ethz.ch
Tri Kurniawan Wijaya, Karl Aberer
Department of Computer Science
EPFL, Switzerland
Email: {tri-kurniawan.wijaya,
karl.aberer}@epfl.ch
Nuno Nunes
Madeira Interactive Technologies Institute
Funchal, Portugal
Email: njn@uma.pt
Abstract—While tasks such as segmenting images or deter-
mining the sentiment expressed in a sentence can be assigned to
regular users, some others require background knowledge and
thus, the selection of expert users. In the case of energy datasets,
acquiring data represents an obstacle to develop data-driven
methods, due to prohibitive monetary and time costs linked to
the instrumentation of households in order to monitor the energy
consumption. More so, most datasets only contain pure power
time series, despite labels being required to determine when a
device is in use from when it is idle (incurring stand-by consump-
tion or being off), and by extension to separate human activities
triggering the consumption from the baseline consumption. We
build upon our Collaborative Annotation Framework for Energy
Datasets (CAFED) to evaluate and distinguish the performance
of expert users against that of regular users. Through a user
study with curated benchmark annotation tasks, we provide data-
driven and efficient techniques to detect weak and adversarial
workers and promote users when the contributors’ user-base is
limited. Additionally, we show that if carefully selected, the seed
gold standard tasks can be reduced to a small number of tasks
that are representative enough to determine the user’s expertise
and predict crowd-combined annotations with high precision.
Index Terms—Time series analysis; Data mining; Information
search and retrieval; Collaboration; Crowdsourcing; Smart en-
ergy; Smart meters; Energy data analytics; Datasets; Algorithms
I. I NTRODUCTION
The development of learning algorithms entices the usage
of data to improve and evaluate the accuracy of their outcome.
Before the spread of online platforms, acquiring ground truth
data was tedious as it was difficult and costly to recruit workers
to perform specific tasks. These were then often solved by
benevolent lab mates and it took considerable time to collect
those datasets. Nowadays, the majority of the micro-tasks that
are present on Amazon Mechanical Turk or CrowdFlower
consist of image and text labeling and have contributed to
build large scale datasets that have allowed progress in the
fields of computer vision and natural language processing.
However, the introduction of a monetary gain instead of the
benevolence of fellow researchers or acquaintances to label
such data can lead to the abuse of the system to increase
workers’ remuneration, at the expense of the quality of the
data.
While obtaining labels for text or image content can be
distributed to a larger audience of workers due to the nature
of the tasks themselves, and can piggyback on existing systems
such as CAPTCHAs, crowdsourcing tasks for different fields
such as labeling genes or locating volcanoes in satellite images
would require domain knowledge expertise that is not widely
available to the general public. Energy analytics, where data
are obtained through the instrumentation of households to ob-
tain power data from dwellings, has benefited from the adop-
tion of smart meters to replace semesterly or yearly reporting.
These enabled the release of datasets collected by different
research institutes and organizations with household-level ag-
gregated load consumption at finer granularity. However, for
the development of human activity-level or more generally
event-based algorithms linked to the consumption of energy
caused by households’ residents, more labels that can be used
for training and testing the algorithms are required. This is
due to the fact new datasets have to be collected to include
more appliances and real-time annotations from the residents:
existing datasets have the shortcomings of having either been
collected at coarser time granularities, for shorter periods,
including few appliances (sometimes having only aggregated
household consumption) or simply without event-based labels
(appliances’ states or human activities). High monetary costs
to successfully and reliably carry out data collections have
hindered the advances in this domain. They are mostly related
to the complexity in instrumenting households: the type of
electrical appliances and the electrical wiring can force the
sub-metering to be performed at the circuit level, requires
expensive hardware and the assistance of certified electricians,
preventing the usage of cheaper alternatives such as smart
plugs that can be inserted between the appliance’s socket and
the electrical outlet. Our Collaborative Annotation Framework
for Energy Datasets (CAFED)
1
[1] represented the first effort
to retrofit labeling on an existing dataset by leveraging the
wisdom of domain experts to annotate an appliance as being
active or idle, based on the time series representing its power
consumption.
The contribution of this paper consists in i) providing a
1
https://cafed.inf.ethz.ch