JPrivacy: A Java Privacy Profiling Framework for Big Data Applications Mohamed Abdellatif Department of Computer Science University of Miami Coral Gables, USA m.abdellatif@miami.edu Iman Saleh Department of Computer Science University of Miami Coral Gables, USA iman@miami.edu M. Brian Blake Department of Computer Science University of Miami Coral Gables, USA m.brian.blake@miami.edu Abstract—Businesses and government agencies are continuously generating and collecting huge amounts of data and building related Big Data applications. Big Data applications involve the collaborative integration of APIs from different providers. A challenge in this domain is to guarantee the conformance of the integration to privacy terms and regulations. In this paper, we present JPrivacy, a privacy profiling framework for Big Data applications. JPrivacy proposes a model for privacy rules and provide the algorithms and related tools to check Java code against these rules. We show through experimentation that JPrivacy can effectively detect privacy violations by statically analyzing a piece of code. Keywords: Privacy, Big Data, Java, Static Analysis I. INTRODUCTION Given the inexpensive nature and availability of information storage media, individuals worldwide have exponentially increased their production and persistence of large amounts of data whether such data are captured as text, images, or sound. Analysis of these Big Data repositories introduces fascinating new opportunities for discovering new insights that contribute to different branches of science. The potential of Big Data comes however with a price; the users’ privacy is often at risk. Guarantees of conformance to privacy terms and regulations are limited in current Big Data analytics and mining practices. Unlike relational databases that exhibit a clear structure, Big Data is characterized by its unstructured nature and the variety of data types including both textual and audio-visual material. Only the Big Data applications encapsulate the logic that makes sense of such unstructured repositories. Hence, our work comes to provide tools and frameworks to build trusted Big Data applications. Using our framework, Big Data developers are able to verify that their code complies with privacy agreements and that sensitive users’ information is kept private regardless of changes in the applications and/or privacy regulations. Our work investigates the following research questions: - RQ1: How to formally specify privacy? Can we devise machine-readable privacy rules? - RQ2: How to extract privacy rules from natural language descriptions and formalize regulations such as the HIPAA? - RQ3: How can we leverage the formal definition of privacy to reason about privacy conformance for a piece of code? - RQ4: How to automatically generate tests from formal specification of privacy? In this paper, we address RQ1 and RQ3. We present JPrivacy; a privacy profiling system for Java code. JPrivacy is based on a formal model for privacy rules and provide the algorithms and related tools to check Java code against these rules. Figure 1 shows the JPrivacy framework and its main components. JPrivacy takes as input a Java application and a natural-language description of privacy terms. It formalizes the privacy terms and checks the application’s code for potential violations of these terms. JPrivacy can also leverage these terms in order to generate test cases. These test cases guarantee that an application continues to comply with the privacy regulations as the code and underlying Big Data repositories evolves. Figure 1. JPrivacy Framework II. RELATED WORK Research has recognized the importance of building security and privacy measures into software systems. Most work focuses however on the extraction of requirements from security-related policies and regulations [1][2][3]. Other work extract formal privacy requirements from legal regulations such as HIPPA [4][5]. The notion of privacy is defined in terms of access right to sensitive data. Once the requirements are defined, a traditional software engineering process can take place. As we consider complex collaboratively built Big Data applications, we are building privacy consideration into the software engineering process. III. PRIVACY MODEL Our proposed privacy model categorizes data from a privacy standpoint. We identify four categories of data; Critical Sensitivity (CS), High Sensitivity (HS), Moderate Sensitivity (MS), Low Sensitivity (LS) and Non-Sensitive (NS). As the names suggest, these categories associate a level of sensitivity to a piece of data. Next, we define the different operations on data that are relevant to a privacy checker. We identify three categories of operations that can be done on a data field: reading, writing, and sharing. By writing a data field, we mean having it persists in a file or a database. A data field can be written in encrypted or plaintext formats. On the other hand, sharing a data field is basically sending it as a parameter to third party software. This third party can be an API, a Web Service or simply a code module. The third party can be a trusted or untrusted. JPrivacy Static Code Analysis Automatic Test Case Generation Our Privacy Policy explains: What informa8on we collect and why we collect it. How we use that informa8on. The choices we offer, including how to access and update informa8on. We’v terms like cookies, IP addresses, pixel tags and browsers, then read about Privacy Terms Big Data Application Machine-Readable Privacy Rules Tests Privacy Formalization Code Parser Syntax Tree RQ2 RQ1 RQ3 RQ4 >WARNING >ERROR Privacy Violations COLLABORATECOM 2014, October 22-25, Miami, United States Copyright © 2014 ICST DOI 10.4108/icst.collaboratecom.2014.257666