Fragmentation Design for Efficient Query Execution over Sensitive Distributed Databases Valentina Ciriani * , Sabrina De Capitani di Vimercati * , Sara Foresti * , Sushil Jajodia , Stefano Paraboschi , and Pierangela Samarati * * DTI - University of Milan, 26013 Crema - Italy Email: {ciriani,decapita,foresti,samarati}@dti.unmi.it CSIS - George Mason University, Fairfax, VA 22030-4444 Email: jajodia@gmu.edu DIIMM - University of Bergamo, 24044 Dalmine - Italy Email: parabosc@unibg.it Abstract The balance between privacy and utility is a classical problem with an increasing impact on the design of modern information systems. On the one side it is crucial to ensure that sensitive information is properly protected; on the other side, the impact of protection on the workload must be limited as query efficiency and system performance remain a primary requirement. We address this privacy/efficiency balance proposing an approach that, starting from a flex- ible definition of confidentiality constraints on a relational schema, applies encryption on information in a parsimo- nious way and mostly relies on fragmentation to protect sensitive associations among attributes. Fragmentation is guided by workload considerations so to minimize the cost of executing queries over fragments. We discuss the minimiza- tion problem when fragmenting data and provide a heuristic approach to its solution. 1. Introduction A medical organization manages a collection of data recording the medical histories of a community of patients. Researchers can then access these data and effectively and efficiently discover behavioral and social patterns that exhibit correlation with specific pathologies, with a direct positive impact on medical research. The downside is that a compromise of the server can disclose patients’ information and violate their privacy. The owner of an e-commerce Web site must store the complete description of the financial data about transactions executed on the site. The Web site offers a wider choice and lower prices than a brick-and- mortar store, producing an immediate benefit to consumers and a considerable positive economic impact. The downside is that a compromise of the Web server may bring cus- tomers’ data into the black market, where they can be used in fraudulent transactions. The two scenarios demonstrate that, while information and communication technology can provide important benefits, they inevitably introduce risks of exposing private information to improper disclosure. The proposal in this paper aims at reducing the risks introduced by the management of sensitive information. The crucial observation behind our approach is that users of the system may normally need to access the data in a way that does not introduce risks. For instance, medical researchers may typically need to access generic and not- identifying patient data when performing their research. The owner of the Web site mostly accesses the financial data about the transactions managed by the Web site with no need to reference the personal data of the customer. On the other hand, medical researchers may sometimes need to evaluate parameters that may lead to the specific identity of the patient, and the Web site owner may need to retrieve the complete credit card data when a dispute arises. In addition, regulations are forcing requirements on the management of personal information that often explicitly demand the use of encryption for the protection of sensitive data. A promising approach to protect sensitive data or sen- sitive associations among data stored at external parties is represented by the combined use of fragmentation and encryption [4]. Fragmentation and encryption provide pro- tection of data in storage, or when disseminated, ensuring no sensitive information is disclosed neither directly (i.e., present in the database) nor indirectly (i.e., derivable from other information in the database). With this design, the data can be outsourced and stored on an untrusted server, typically obtaining lower costs, greater availability, and more efficient distributed access. This scenario resembles the “database-as-a-service” (DAS) paradigm [3], [6] and indeed the techniques presented in the paper can be considered an adaptation of this paradigm to a context where only part of the information stored into the database is confidential and where the confidentiality of associations among values is protected by storing them in separate fragments. The advantage of having only part of the data encrypted is that all the queries that do not require to reconstruct the confidential © 2009 IEEE. Personal use of this material is permitted. However, permission to reprint/republish this material for advertising or promotional purposes or for creating new collective works for resale or redistribution to servers or lists, or to reuse any copyrighted component of this work in other works must be obtained from the IEEE.