Optimal tag suppression for privacy protection in the semantic Web
Javier Parra-Arnau ⁎, David Rebollo-Monedero, Jordi Forné, Jose L. Muñoz, Oscar Esparza
Dept. of Telematics Engineering, Universitat Politècnica de Catalunya, C. Jordi Girona 1-3, E-08034 Barcelona, Spain
article info abstract
Article history:
Received 10 December 2010
Received in revised form 24 July 2012
Accepted 28 July 2012
Available online 25 August 2012
Leveraging on the principle of data minimization, we propose tag suppression, a privacy-
enhancing technique for the semantic Web. In our approach, users tag resources on the Web
revealing their personal preferences. However, in order to prevent privacy attackers from
profiling users based on their interests, they may wish to refrain from tagging certain resources.
Consequently, tag suppression protects user privacy to a certain extent, but at the cost of semantic
loss incurred by suppressing tags. In a nutshell, our technique poses a trade-off between privacy
and suppression. In this paper, we investigate this trade-off in a mathematically systematic
fashion and provide an extensive theoretical analysis. We measure user privacy as the entropy of
the user's tag distribution after the suppression of some tags. Equipped with a quantitative
measure of both privacy and utility, we find a close-form solution to the problem of optimal tag
suppression. Experimental results on a real-world tagging application show how our approach
may contribute to privacy protection.
© 2012 Elsevier B.V. All rights reserved.
Keywords:
Information privacy
Privacy-enhancing technology
Shannon entropy
Privacy–suppression trade-off
Semantic Web
Tagging systems
1. Introduction
The World Wide Web constitutes the largest repository of information in the world. Since its invention in the nineties, the form in
which information is organized has evolved substantially. At the beginning, Web content was classified in directories belonging to
different areas of interest, manually maintained by experts. These directories provided users with accurate information, but as the
Web grew they rapidly became unmanageable. Although they are still available, they have been progressively dominated by the
current search engines based on Web crawlers, which explore new or updated content in a methodic, automatic manner. However,
even though search engines are able to index a large amount of Web content, they may provide irrelevant results or fail when terms
are not explicitly included in Web pages. A query containing the keyword accommodation, for instance, would not retrieve pages with
terms such as hotel or apartment not including that keyword.
Recently, a new form of conceiving the Web, called the semantic Web [1], has emerged to address this problem. The semantic
Web, envisioned by Tim Berners-Lee in 2001, is expected to provide Web content with a conceptual structure so that information
can be interpreted by machines. For this to become a reality, the semantic Web requires to explicitly associate meaning with
resources on the Web. A widely spread manner to accomplish this is by means of semantic tagging.
One of the major benefits of associating concepts with Web pages is clearly the semantic interoperability in Web applications.
In addition, tagging will allow these applications to decrease the interaction with users, to obtain some form of semantic distance
between pages and to ultimately process pages whose content is nowadays only understandable by humans. In a nutshell, the
semantic Web lies the foundation for a future scenario where intelligent software agents will be able to automatically book flights
for us, update our medical records at our request and provide us with personalized answers to particular queries, without the
hassle of exhaustive literal searches across myriads of disorganized data [2]. In the meantime, we can enjoy some instances,
although limited in scope, of this new conception of the Web, namely the tagging systems that have proliferated over the last
Data & Knowledge Engineering 81–82 (2012) 46–66
⁎ Corresponding author. Tel.: +34 93 401 7027.
E-mail addresses: javier.parra@entel.upc.edu (J. Parra-Arnau), david.rebollo@entel.upc.edu (D. Rebollo-Monedero), jforne@entel.upc.edu (J. Forné),
jose.munoz@entel.upc.edu (J.L. Muñoz), oscar.esparza@entel.upc.edu (O. Esparza).
0169-023X/$ – see front matter © 2012 Elsevier B.V. All rights reserved.
http://dx.doi.org/10.1016/j.datak.2012.07.004
Contents lists available at SciVerse ScienceDirect
Data & Knowledge Engineering
journal homepage: www.elsevier.com/locate/datak