The Gist of Everything New: Personalized Top-k Processing over Web 2.0 Streams Parisa Haghani EPFL Lausanne, Switzerland parisa.haghani@epfl.ch Sebastian Michel Saarland University Saarbrücken, Germany smichel@mmci.uni- saarland.de Karl Aberer EPFL Lausanne, Switzerland karl.aberer@epfl.ch ABSTRACT Web 2.0 portals have made content generation easier than ever with millions of users contributing news stories in form of posts in weblogs or short textual snippets as in Twit- ter. Efficient and effective filtering solutions are key to allow users stay tuned to this ever-growing ocean of information, releasing only relevant trickles of personal interest. In clas- sical information filtering systems, user interests are formu- lated using standard IR techniques and data from all avail- able information sources is filtered based on a predefined absolute quality-based threshold. In contrast to this restric- tive approach which may still overwhelm the user with the returned stream of data, we envision a system which con- tinuously keeps the user updated with only the top-k rele- vant new information. Freshness of data is guaranteed by considering it valid for a particular time interval, controlled by a sliding window. Considering relevance as relative to the existing pool of new information creates a highly dy- namic setting. We present POL-filter which together with our maintenance module constitute an efficient solution to this kind of problem. We show by comprehensive perfor- mance evaluations using real world data, obtained from a weblog crawl, that our approach brings performance gains compared to state-of-the-art. Categories and Subject Descriptors H.3.3 [Information Storage and Retrieval]: Information Search and Retrieval—Search process ; H.3.1 [Information Storage and Retrieval]: Content Analysis and Indexing— Indexing methods ; H.4.m [Information Systems]: Miscel- laneous This work is partially supported by NCCR-MICS (grant number 5005-67322), the FP7 EU Project OKKAM (con- tract no.ICT-215032), and the German Research Founda- tion (DFG) Cluster of Excellence ”Multimodal Computing and Interaction“ (MMCI). Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. CIKM’10, October 26–30, 2010, Toronto, Ontario, Canada. Copyright 2010 ACM 978-1-4503-0099-5/10/10 ...$10.00. General Terms Algorithms, Experimentation Keywords Top-k Query, Information Filtering, Data Streams 1. INTRODUCTION The world has turned into one large-scale interconnected information system with millions of users. With the advent of Web 2.0, yesterday’s end users are now content generators themselves and actively contribute to the Web. Each user action, for example uploading a picture, tagging a video, or commenting on a blog, can be interpreted as an event in a corresponding stream. Given the immense volume of this data and its vast diversity, there is a vital need for effective filtering methods which allow users to efficiently follow personally interesting information and stay tuned. Currently popular methods place the filter on the data sources: mechanisms such as RSS and atom are used to notify users of newly published data on their favored we- blogs or news portals. However, with the currently avail- able functionalities, users can only decide to be notified of new posts on certain blogs or follow certain other users as in Twitter. This limits the number of subscriptions users make, as otherwise the amount of received information will be overwhelming for human processing. On the other hand, traditional information filtering systems [5, 27], aggregate all available information sources and allow users to specify their interests as profiles. Given a similarity measure be- tween the data and the profiles, only data which passes a certain quality-based threshold is returned to the user. Al- though this diversifies the returned results as opposed to the previous method, it can easily result in flooding the user with returning too many data. Choosing a suitable threshold to avoid overwhelming the user or returning very few results is very hard due to the ever changing nature of incoming data. This calls for a system which deems relevance as rel- ative to the existing pool of information [21], as opposed to absolute relevance. Furthermore, to account for the desire of consuming new information and to prohibit repeatedly returning highly relevant, but old information, data is con- sidered valid for only a certain time interval, controlled by a sliding window. Note that in the context of Web 2.0, all information come with explicit temporal annotations e.g., written at, uploaded at, which makes them natural items of a temporal stream; therefore the definition of a sliding win- dow is meaningful. The dynamism introduced as a result of considering relevance relative in a frequently changing infor- mation pool, as well as the scale of our envisioned system 489