Big Data Privacy by Design Computation Platform Extended Abstract Rui Nuno Lopes Claro Instituto Superior Técnico, Universidade de Lisboa Lisbon, Portugal rui.claro@tecnico.ulisboa.pt ABSTRACT We live in the age of Big Data. Personal user data, in par- ticular, are necessary for the operation and improvement of everyday Internet services like Google, Facebook, WhatsApp, Spotify, etc. Many times, the capture and use of personal data are not made explicit to the users, but they are central to the business model of the companies. However, the right to privacy of each individual has to be respected. How can these two conficting needs be reconciled, i.e. how can we build useful Big Data systems that are respectful regarding user privacy? The goal of this work is to design and implement a łproof-of-conceptž of a platform for performing privacy- preserving computations, providing an easy-to-use method to implement privacy-preserving techniques. This system could be used to encapsulate algorithms that can, for exam- ple, monitor the vital signs of patients (without exposing the data to other people), produce real-time recommendations based on location (without disclosing the location to others), etc. This proof-of-concept implemented privacy-preserving versions of Machine Learning algorithms and compared them against a baseline reference, allowing a better understanding of the trade-ofs of using privacy-preserving technology. KEYWORDS Privacy-preserving Computations; Machine Learning; Data Mining; Big Data; Data Processing; Secure multi-Party Com- putation 1 INTRODUCTION With the so-called łBig Data revolutionž, vast amounts of data are now being analyzed and processed by companies that take advantage of the enormous quantities of informa- tion that is generated every day 1 . Through this data pro- cessing, meaningful information can be obtained to improve existing systems or to discover new approaches in business models. An example of this lies in the feld of healthcare, where it can be benefcial to match patient records from dif- ferent hospitals in order to identify inefciencies and develop best practices [8]. 1 http://www.vcloudnews.com/every-day-big-data-statistics-2-5- quintillion-bytes-of-data-created-daily/ Most times data contains private information about indi- viduals, such as health records or daily routines. This kind of data cannot be freely processed because that leads to breaches of private information, such as the AOL Search Leak 2 . Due to these breaches, and despite the value that Data Mining (DM) adds to businesses and medical systems, con- sumers show an increasing concern in the privacy threats posed by it [2]. The privacy of an individual may be violated due to, for example, unauthorized access to personal data, or the use of personal data for purposes other than the one for which data was collected. To deal with the privacy issues in DM, a sub-feld known as Privacy-preserving Data Mining (PPDM) has been gain- ing infuence over the last years [3]. The objective of PPDM is to guarantee the privacy of sensitive information, while at the same time preserve the utility of the data for knowl- edge learning purposes [1]. This can be achieved by using one or more privacy-preserving techniques, such as Garbled Circuits (GC) or Homomorphic Encryption (HE) [3]. Machine Learning (ML) algorithms in the context of Big Data processing are also producing signifcant results, so that it is possible to do knowledge learning from datasets in order to predict future labels (i.e. classes of data) or clusters (groups of related data) for new data. An example of an application of ML algorithms in DM is Classifcation [7], in which a training set is processed in order to create a classifer for data, and then that classifer is used to predict class labels for new data. These applications show a greater impact in the feld of medicine as mentioned above. For example, DeepMind (Google) building ML algorithms to process admissions in hospitals 3 . By combining ML algorithms and privacy-preserving tech- niques, it is possible to create DM processes that, not only allow for knowledge learning on large datasets but also to maintain a level of privacy that is desirable by individuals and that complies with the laws in force [3]. In this work, we present a proof-of-concept platform for privacy-preserving distributed ML computations without resorting to third parties. With it, we aim to give users a ML 2 https://www.networkworld.com/article/2185187/security/15-worst- internet-privacy-scandals-of-all-time.html 3 https://deepmind.com/applied/deepmind-health/