842 Privacy Preserving Data Portals Benjamin C. M. Fung Simon Fraser University, Canada Copyright © 2007, Idea Group Inc., distributing in print or electronic forms without written permission of IGI is prohibited. IntroductIon Information in a Web portal often is an integration of data collected from multiple sources. A typical example is the concept of one-stop service, for example, a single health portal provides a patient all of her/his health history, doctor’s information, test results, appointment bookings, insurance, and health reports. This concept involves information sharing among multiple parties, for example, hospital, drug store, and insurance company. On the other hand, the general public, however, has growing concerns about the use of personal information. Samarati (2001) shows that linking two data sources may lead to unexpectedly reveal- ing sensitive information of individuals. In response, new privacy acts are enforced in many countries. For example, Canada launched the Personal Information Protection and Electronic Document Act in 2001 to protect a wide spectrum of information (The House of Commons in Canada, 2000). Consequently, companies cannot indiscriminately share their private information with other parties. A data portal provides a single access point for Web clients to retrieve data. Also, it serves a logical point to determine the trade-off between information sharing and privacy pro- tection. Can the two goals be achieved simultaneously? This chapter formalizes this question to a problem called secure portals integration for classifcation and presents a solution for it. Consider the model in Figure 1. A hospital A and an insurance company B own different sets of attributes about the same set of individuals identifed by a common key. They want to share their data via their data portals and present an integrated version in a Web portal to support decision making, such as credit limit or insurance policy approval, while satisfying two privacy requirements: 1. The fnal integrated table has to satisfy the k-anonymity requirement, that is, given a specifed set of attributes called a quasi-identifer (QID), each value of the QID must be shared by at least k records in the integrated table (Dalenius, 1986). 2. No party can learn more detailed information from another party other than those in the fnal integrated table during the process of generalization. Simply joining their data at raw level (e.g., birthday and city) may violate the k-anonymity requirement. Therefore, data portals have to cooperate to determine a generalized version of integrated data (e.g., birth year and province) such that the generalized table remains useful for classifcation analysis, such as insurance plan approval. Let us frst review some building blocks in the literature. Then we elaborate an algorithm, called top-down specialization for 2-party (Wang, Fung, & Dong, 2005), that studies the problem. bacKground Privacy-preserving data mining is a study of performing a data-mining task, such as classifcation, association, and clustering, without violating some given privacy require- ment. Recently, this topic has gained enormous attention Figure 1. Secure portals integration for classifcation Generalized data Data Private DB Private DB Private DB Private DB Data Integrated Web Portal (for classification analysis) Party A Party B