Optimal tag suppression for privacy protection in the semantic Web Javier Parra-Arnau ⁎, David Rebollo-Monedero, Jordi Forné, Jose L. Muñoz, Oscar Esparza Dept. of Telematics Engineering, Universitat Politècnica de Catalunya, C. Jordi Girona 1-3, E-08034 Barcelona, Spain article info abstract Article history: Received 10 December 2010 Received in revised form 24 July 2012 Accepted 28 July 2012 Available online 25 August 2012 Leveraging on the principle of data minimization, we propose tag suppression, a privacy- enhancing technique for the semantic Web. In our approach, users tag resources on the Web revealing their personal preferences. However, in order to prevent privacy attackers from profiling users based on their interests, they may wish to refrain from tagging certain resources. Consequently, tag suppression protects user privacy to a certain extent, but at the cost of semantic loss incurred by suppressing tags. In a nutshell, our technique poses a trade-off between privacy and suppression. In this paper, we investigate this trade-off in a mathematically systematic fashion and provide an extensive theoretical analysis. We measure user privacy as the entropy of the user's tag distribution after the suppression of some tags. Equipped with a quantitative measure of both privacy and utility, we find a close-form solution to the problem of optimal tag suppression. Experimental results on a real-world tagging application show how our approach may contribute to privacy protection. © 2012 Elsevier B.V. All rights reserved. Keywords: Information privacy Privacy-enhancing technology Shannon entropy Privacy–suppression trade-off Semantic Web Tagging systems 1. Introduction The World Wide Web constitutes the largest repository of information in the world. Since its invention in the nineties, the form in which information is organized has evolved substantially. At the beginning, Web content was classified in directories belonging to different areas of interest, manually maintained by experts. These directories provided users with accurate information, but as the Web grew they rapidly became unmanageable. Although they are still available, they have been progressively dominated by the current search engines based on Web crawlers, which explore new or updated content in a methodic, automatic manner. However, even though search engines are able to index a large amount of Web content, they may provide irrelevant results or fail when terms are not explicitly included in Web pages. A query containing the keyword accommodation, for instance, would not retrieve pages with terms such as hotel or apartment not including that keyword. Recently, a new form of conceiving the Web, called the semantic Web [1], has emerged to address this problem. The semantic Web, envisioned by Tim Berners-Lee in 2001, is expected to provide Web content with a conceptual structure so that information can be interpreted by machines. For this to become a reality, the semantic Web requires to explicitly associate meaning with resources on the Web. A widely spread manner to accomplish this is by means of semantic tagging. One of the major benefits of associating concepts with Web pages is clearly the semantic interoperability in Web applications. In addition, tagging will allow these applications to decrease the interaction with users, to obtain some form of semantic distance between pages and to ultimately process pages whose content is nowadays only understandable by humans. In a nutshell, the semantic Web lies the foundation for a future scenario where intelligent software agents will be able to automatically book flights for us, update our medical records at our request and provide us with personalized answers to particular queries, without the hassle of exhaustive literal searches across myriads of disorganized data [2]. In the meantime, we can enjoy some instances, although limited in scope, of this new conception of the Web, namely the tagging systems that have proliferated over the last Data & Knowledge Engineering 81–82 (2012) 46–66 ⁎ Corresponding author. Tel.: +34 93 401 7027. E-mail addresses: javier.parra@entel.upc.edu (J. Parra-Arnau), david.rebollo@entel.upc.edu (D. Rebollo-Monedero), jforne@entel.upc.edu (J. Forné), jose.munoz@entel.upc.edu (J.L. Muñoz), oscar.esparza@entel.upc.edu (O. Esparza). 0169-023X/$ – see front matter © 2012 Elsevier B.V. All rights reserved. http://dx.doi.org/10.1016/j.datak.2012.07.004 Contents lists available at SciVerse ScienceDirect Data & Knowledge Engineering journal homepage: www.elsevier.com/locate/datak