A Versatile Tool for Privacy-Enhanced Web Search Avi Arampatzis, George Drosatos, and Pavlos S. Efraimidis Department of Electrical and Computer Engineering Democritus University of Thrace, Xanthi 67 100, Greece {avi,gdrosato,pefraimi}@ee.duth.gr Abstract. We consider the problem of privacy leaks suffered by Internet users when they perform web searches, and propose a framework to mitigate them. Our approach, which builds upon and improves recent work on search privacy, approximates the target search results by replacing the private user query with a set of blurred or scrambled queries. The results of the scrambled queries are then used to cover the original user interest. We model the problem theoretically, define a set of privacy objectives with respect to web search and investigate the effectiveness of the proposed solution with a set of real queries on a large web collection. Experiments show great improvements in retrieval effectiveness over a previously reported baseline in the literature. Furthermore, the methods are more versatile, predictably-behaved, applicable to a wider range of information needs, and the privacy they provide is more comprehensible to the end-user. 1 Introduction In 2006, AOL released query-log data containing about 21 million web queries col- lected from about 650,000users over three months [8]. To protect user privacy, each real IP address had been replaced with a random ID. Soon after the release, the first ‘anonymous’ user had been identified from the data [2]. Interestingly, this identification was made solely on the queries attributed to an anonymous ID. Even though AOL with- drew the data a few days after the privacy breach, copies of the collection still circulate freely online. The incident only substantiated what was already known: web search can pose serious threats on the privacy of Internet users. The incident has motivated lots of research in web-log anonymization and solutions using anonymized or encrypted connections, agents, obfuscating by random additional queries, and other techniques; for a recent extensive review on the literature, we refer the reader to [1]. There is an important reason why all the aforementioned methods alone might be inadequate: in all cases, the query is revealed in its clear form. Thus, such approaches would not hide the existence of the interest at the search engine’s end or from any sites in the network path. In addition, using anonymization tools or encryption, the plausible deniability towards the existence of a private search task at the user’s end is weakened. In other words, when a user employs the above technologies, the engine still knows that someone is looking for “lawyers for victims of child rape”, and the user cannot deny that she has a private search task which may be the aforementioned one. A way to achieve plausible deniability was recently presented in [1], called query scrambler, and works as follows. Given a private query, generate a set of scrambled P. Serdyukov et al. (Eds.): ECIR 2013, LNCS 7814, pp. 368–379, 2013. c Springer-Verlag Berlin Heidelberg 2013