Revisiting Sequential Pattern Hiding to Enhance Utility Aris Gkoulalas-Divanis Information Analytics Lab IBM Research – Zurich Rüschlikon, Switzerland agd@zurich.ibm.com Grigorios Loukides Health Information Privacy Lab Vanderbilt University Nashville, Tennessee, USA g.loukides@vanderbilt.edu ABSTRACT Sequence datasets are encountered in a plethora of applica- tions spanning from web usage analysis to healthcare stud- ies and ubiquitous computing. Disseminating such datasets offers remarkable opportunities for discovering interesting knowledge patterns, but may lead to serious privacy vio- lations if sensitive patterns, such as business secrets, are disclosed. In this work, we consider how to sanitize data to prevent the disclosure of sensitive patterns during sequential pattern mining, while ensuring that the nonsensitive pat- terns can still be discovered. First, we re-define the problem of sequential pattern hiding to capture the information loss incurred by sanitization in terms of both events’ modifica- tion (distortion) and lost nonsensitive knowledge patterns (side-effects). Second, we model sequences as graphs and propose two algorithms to solve the problem by operating on the graphs. The first algorithm attempts to sanitize data with minimal distortion, whereas the second focuses on re- ducing the side-effects. Extensive experiments show that our algorithms outperform the existing solution in terms of data distortion and side-effects and are more efficient. Categories and Subject Descriptors H.2.8 [Database Management]: Database Applications— Data Mining General Terms Algorithms, Security, Performance Keywords Data privacy, Knowledge hiding, Sequential pattern hiding 1. INTRODUCTION Sequential data are increasingly collected to support nu- merous applications in which the sequentiality of events is of primary interest. Examples of such data are web usage Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. KDD’11, August 21–24, 2011, San Diego, California, USA. Copyright 2011 ACM 978-1-4503-0813-7/11/08 ...$10.00. logs, which record web page accesses, or mobility data that capture the location of mobile devices at different moments in time [16]. Clearly, sequential data offer remarkable oppor- tunities for discovering interesting behavioral patterns that can be beneficial to a broad community of people. For exam- ple, mining user mobility data can reveal interesting patterns that aid traffic engineers and environmentalists in their deci- sions. Publishing sequential data for data mining purposes, however, may lead to serious privacy violations, if sensitive knowledge patterns are discovered. For instance, the min- ing of knowledge patterns from mobility datasets may enable intrusive inferences regarding the habits of a portion of the population, or provide the means for unsolicited advertise- ment and user profiling. Similar concerns have also been raised related to medical data sharing [13, 21]. To address these concerns, knowledge hiding methods [5] are necessary. These methods conceal sensitive patterns that can otherwise be mined from published data, without seri- ously affecting the data and the nonsensitive interesting pat- terns. Clifton and Marks [12], following D. E. O’Leary [24] who firstly pointed out the privacy breaches that originate from data mining algorithms, indicated the need to consider data mining approaches under the prism of privacy preserva- tion. Since then, several methods emerged to hide knowledge that appears in the form of frequent itemsets and related as- sociation rules [19, 27, 29], or classification rules [11, 23]. Unlike these works, this paper considers the problem of hiding sensitive knowledge that appears in the form of fre- quent sequences and can be disclosed through sequence pat- tern mining algorithms [6, 9]. Sequential pattern hiding is a challenging problem, because sequences have more complex semantics than itemsets, and calls for efficient solutions that offer high utility. To our knowledge, only the work of [3,4] attempts to address this problem, but it may fail to identify high-quality hiding solutions, as we discuss in Section 2. Our work makes the following contributions: • We re-define the problem of sequential pattern hiding to capture the utility of released data by considering both the side-effects and the distortion introduced by the hiding process. This allows the production of more useful data for the task they are disseminated for. • We design two novel sequence hiding algorithms. The first algorithm aims to minimize data distortion, whereas the second focuses on ensuring that the non- sensitive interesting knowledge can still be discovered. • We extensively evaluate our algorithms, demonstrating that they significantly outperform the existing solution 1316