Revisiting Sequential Pattern Hiding to Enhance Utility Aris Gkoulalas-Divanis Information Analytics Lab IBM Research – Zurich Rüschlikon, Switzerland agd@zurich.ibm.com Grigorios Loukides Health Information Privacy Lab Vanderbilt University Nashville, Tennessee, USA g.loukides@vanderbilt.edu ABSTRACT Sequence datasets are encountered in a plethora of applica- tions spanning from web usage analysis to healthcare stud- ies and ubiquitous computing. Disseminating such datasets oﬀers remarkable opportunities for discovering interesting knowledge patterns, but may lead to serious privacy vio- lations if sensitive patterns, such as business secrets, are disclosed. In this work, we consider how to sanitize data to prevent the disclosure of sensitive patterns during sequential pattern mining, while ensuring that the nonsensitive pat- terns can still be discovered. First, we re-deﬁne the problem of sequential pattern hiding to capture the information loss incurred by sanitization in terms of both events’ modiﬁca- tion (distortion) and lost nonsensitive knowledge patterns (side-eﬀects). Second, we model sequences as graphs and propose two algorithms to solve the problem by operating on the graphs. The ﬁrst algorithm attempts to sanitize data with minimal distortion, whereas the second focuses on re- ducing the side-eﬀects. Extensive experiments show that our algorithms outperform the existing solution in terms of data distortion and side-eﬀects and are more eﬃcient. Categories and Subject Descriptors H.2.8 [Database Management]: Database Applications— Data Mining General Terms Algorithms, Security, Performance Keywords Data privacy, Knowledge hiding, Sequential pattern hiding 1. INTRODUCTION Sequential data are increasingly collected to support nu- merous applications in which the sequentiality of events is of primary interest. Examples of such data are web usage Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for proﬁt or commercial advantage and that copies bear this notice and the full citation on the ﬁrst page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior speciﬁc permission and/or a fee. KDD’11, August 21–24, 2011, San Diego, California, USA. Copyright 2011 ACM 978-1-4503-0813-7/11/08 ...$10.00. logs, which record web page accesses, or mobility data that capture the location of mobile devices at diﬀerent moments in time [16]. Clearly, sequential data oﬀer remarkable oppor- tunities for discovering interesting behavioral patterns that can be beneﬁcial to a broad community of people. For exam- ple, mining user mobility data can reveal interesting patterns that aid traﬃc engineers and environmentalists in their deci- sions. Publishing sequential data for data mining purposes, however, may lead to serious privacy violations, if sensitive knowledge patterns are discovered. For instance, the min- ing of knowledge patterns from mobility datasets may enable intrusive inferences regarding the habits of a portion of the population, or provide the means for unsolicited advertise- ment and user proﬁling. Similar concerns have also been raised related to medical data sharing [13, 21]. To address these concerns, knowledge hiding methods [5] are necessary. These methods conceal sensitive patterns that can otherwise be mined from published data, without seri- ously aﬀecting the data and the nonsensitive interesting pat- terns. Clifton and Marks [12], following D. E. O’Leary [24] who ﬁrstly pointed out the privacy breaches that originate from data mining algorithms, indicated the need to consider data mining approaches under the prism of privacy preserva- tion. Since then, several methods emerged to hide knowledge that appears in the form of frequent itemsets and related as- sociation rules [19, 27, 29], or classiﬁcation rules [11, 23]. Unlike these works, this paper considers the problem of hiding sensitive knowledge that appears in the form of fre- quent sequences and can be disclosed through sequence pat- tern mining algorithms [6, 9]. Sequential pattern hiding is a challenging problem, because sequences have more complex semantics than itemsets, and calls for eﬃcient solutions that oﬀer high utility. To our knowledge, only the work of [3,4] attempts to address this problem, but it may fail to identify high-quality hiding solutions, as we discuss in Section 2. Our work makes the following contributions: • We re-deﬁne the problem of sequential pattern hiding to capture the utility of released data by considering both the side-eﬀects and the distortion introduced by the hiding process. This allows the production of more useful data for the task they are disseminated for. • We design two novel sequence hiding algorithms. The ﬁrst algorithm aims to minimize data distortion, whereas the second focuses on ensuring that the non- sensitive interesting knowledge can still be discovered. • We extensively evaluate our algorithms, demonstrating that they signiﬁcantly outperform the existing solution 1316