Fusion: Privacy-preserving Distributed Protocol for High-Dimensional Data Mashup Gaby G. Dagher CSE, Concordia University Montr´ eal, Queb´ ec, Canada daghir@encs.concordia.ca Farkhund Iqbal CTI, Zayed University Abu Dhabi, United Arab Emirates Farkhund.Iqbal@zu.ac.ae Mahtab Arafati CIISE, Concordia University Montr´ eal, Queb´ ec, Canada ma arafa@ciise.concordia.ca Benjamin C. M. Fung SIS, McGill University Montr´ eal, Queb´ ec, Canada ben.fung@mcgill.ca Abstract—In the last decade, several approaches concerning private data release for data mining have been proposed. Data mashup, on the other hand, has recently emerged as a mecha- nism for integrating data from several data providers. Fusing both techniques to generate mashup data in a distributed environment while providing privacy and utility guarantees on the output involves several challenges. That is, how to ensure that no unnecessary information is leaked to the other parties during the mashup process, how to ensure the mashup data is protected against certain privacy threats, and how to handle the high-dimensional nature of the mashup data while guaranteeing high data utility. In this paper, we present Fusion, a privacy-preserving multi-party protocol for data mashup with guaranteed LKC- privacy for the purpose of data mining. Experiments on real- life data demonstrate that the anonymous mashup data provide better data utility, the approach can handle high dimensional data, and it is scalable with respect to the data size. Keywords-mashup; privacy; anonymization; data mining; I. I NTRODUCTION As the amount of data available from wide range of domains has increased tremendously in recent years, the demand for data sharing and integration has also risen. The mashup of related data from different sources enables busi- nesses, organizations and government agencies to perform better data analysis and make better decisions. In this paper, we present Fusion, a protocol that enables multiple data providers to engage in a privacy-preserving mashup process to generate a anonymous mashup data with high information utility for data mining tasks such as classification analysis. Throughout the mashup process, a score function needs to be computed between the parties to guide the process. Therefore, we propose a secure protocol for evaluating the score function in a distributed setting. Figure 1 presents an example of a distributed environment for privacy-preserving data mashup. The challenges of mashing-up data from different data providers in a privacy preserving manner are summarized as follows. A major challenge is privacy concerns. Data providers are often reluctant to share their data due to privacy concerns. We distinguish between two types of concerns. The first is to Data Provider Data Provider Data Provider Our Contribution: Secure Data Integration ID Class Sex 1 2 . . . . . . . . . 50 Data Mining: Classification, Clustering, ... Anonymized Data ID Sensitive Salary 1 2 . . . . . . . . . 50 ID Age Education 1 2 . . . . . . . . . 50 Figure 1: Privacy-preserving distributed data mashup allow data providers to evaluate functions on the collective data while ensuring that no party learns more information about other parties’ data, other than what is revealed in the final mashup data. Example I.1. Consider the data in Table I, where three data providers: P 1 , P 2 and P 3 , owns different set of attributes about the same individuals, and P 2 owns the Class attribute. Assume that the parties are building a classifier and need to compute information gain [1] for each attribute. P 2 can directly compute the information for attribute Sex since it knows the class values. However, P 1 and P 3 should be able to compute the information for each of their attributes while the class values remain private (only known to P 2 ). The second concern is to ensure the final mashup data is anonymized such that potential linkage attacks are thwarted. The adversary can perform two types of linkage attacks: record linkage, where an individual can be linked to a record if the record is very specific, and attribute linkage, where a frequent sensitive value can be inferred about an individual. Example I.2. In Table I, if the adversary knows 〈44, 12th, F emale〉 about an individual, then the adver- sary can link the individual to record #7 and sensitive value s 2 . On the other hand, if the adversary knows 〈Bachelor, Male〉, then he infers sensitive value s 2 about