arXiv:1904.00156v1 [cs.CY] 30 Mar 2019 Viewpoint — Personal Data and the Internet of Things It is time to care about digital provenance. omas Pasquier University of Bristol David Eyers University of Otago Jean Bacon University of Cambridge ABSTRACT e Internet of ings promises a connected environment reacting to and addressing our every need, but based on the assumption that all of our movements and words can be recorded and analysed to achieve this end. Ubiquitous surveillance is also a precondition for most dystopian societies, both real and ﬁctional. How our personal data is processed and consumed in an ever more connected world must imperatively be made transparent, and more eﬀective tech- nical solutions than those currently on oﬀer, to manage personal data must urgently be investigated. e need for greater transparency We have all read market predictions describing billions of devices and the hundreds of billions of dollars in proﬁt that the Internet of ings (IoT) promises. 1 Security and the challenges it repre- sents [27] are oﬅen highlighted as major issues for IoT, alongside scalability and standardisation. In 2017, FBI Director James Comey warned, during a senate hearing, of the threat represented by a bot- net taking control of devices owned by unsuspecting users. Such a botnet can seize control of devices ranging from connected dish- washers, 2 to smart home cameras and connected toys, not only us- ing them as a platform to launch cyber aacks, but also potentially harvesting the data such devices collect. In addition to concerns about cybersecurity, corporate usage of personal data has seen increased public scrutiny. A recent focus of concern has been connected home hubs (e.g., Amazon Alexa, Google Home). 3 Articles on the topic discussed whether conversa- tions were being constantly recorded and if so, where those records went. Similarly, the University of Rennes faced a public backlash aﬅer revealing its plan to deploy smart-beds in its accommodation to detect “abnormal” usage paerns. 4 A clear question emerges from IoT-related fears, “how and why is my data being used?” As concerns grow, legislators across the world are taking action in order to protect the public. For example, the recent EU Gen- eral Data Protection Regulation (GDPR) which took eﬀect in May 2018, 5 and the forthcoming ePrivacy Regulation 6 place strong re- sponsibility on data controllers to protect personal data, and to no- tify users of security breaches. e EU commission deﬁnes a Data Controller as the party that determines the purposes for which, and the means by which, personal data is processed (why and how the data is processed). EU regulations further impose constraints on where EU citizens data can be processed and what type of data 1 hps://goo.gl/udt9vh 2 hps://nvd.nist.gov/vuln/detail/CVE-2017-7240 3 hps://www.wired.com/2016/12/alexa-and-google-record-your-voice/ 4 hps://goo.gl/pzC1Kz 5 hp://www.privacy-regulation.eu/en/index.htm 6 hps://ec.europa.eu/digital-single-market/en/proposal-eprivacy-regulation (i.e., “special category” data falls under more stringent constraints). e data controller must provide means for end users to determine whether their data is properly handled and means to eﬀect their rights. Overall, there must be mechanisms to determine what data is processed, how, why and where. Such concerns have drawn researchers to look at means to de- velop more accountable and transparent systems [10, 24]. e prob- lem has also been clearly highlighted by the EU Data Protection Working Party: “As a result of the need to provide pervasive services in an unobtrusive manner, users might in practice ﬁnd themselves under third-party monitoring. is may result in situations where the user can lose all control on the dissemination of his/her data, de- pending on whether or not the collection and processing of this data will be made in a transparent manner or not.” Indeed, modern computing systems contain many components that operate as black boxes; they accept inputs and generate out- puts but do not disclose their internal working. Beyond privacy concerns, this also limits the ability to detect cyber-aacks, or more generally to understand cyber-behaviour. Because of these con- cerns DARPA, in the US, launched the Transparent Computing project 7 to explore means to build more transparent systems through the use of digital provenance with the particular aim of identifying ad- vanced persistent threats. While DARPA’s work is a good start, we believe that there is an urgent need to reach much further. In the rest of the article, we explore how provenance can be an answer to some IoT concerns and the challenges faced to deploy provenance techniques. Digital Provenance ere is a growing clamour for more transparency, but straightfor- ward, widespread technical solutions have yet to emerge. Typical soﬅware log records oﬅen prove insuﬃcient to audit complex dis- tributed systems as they fail to capture the complex causality rela- tionships between events. Digital provenance [8] is an alternative means to record system events. Digital provenance is the record of information ﬂow within a computer system in order to assess the origin of data (e.g., its quality or its validity). e concept ﬁrst emerged in the database research community as a means to explain the response to a given query [16]. Prove- nance research later expanded to address issues of scientiﬁc repro- ducibility, notably by providing mechanisms to reconstitute com- putational environments from formal records of scientiﬁc compu- tations [23]. More recently, provenance has been explored within the cybersecurity community [25] as a means to explain intrusions [18] or more recently to detect them [14]. Provenance records are represented as a directed acyclic graph that shows causality relationships between the states of the objects 7 hps://www.darpa.mil/program/transparent-computing