1 Efficient Monitoring Algorithm for Fast News Alerts Ka Cheung Sia, Junghoo Cho, and Hyun-Kyu Cho Abstract— Recently, there has been a dramatic increase in the use of XML data to deliver information over the Web. Personal weblogs, news Web sites, and discussion forums are now publishing RSS feeds for their subscribers to retrieve new postings. As the popularity of personal weblogs and the RSS feeds grow rapidly, RSS aggregation services and blog search engines have appeared, which try to provide a central access point for simpler access and discovery of new content from a large number of diverse RSS sources. In this paper, we study how the RSS aggregation services should monitor the data sources to retrieve new content quickly using minimal resources and to provide its subscribers with fast news alerts. We believe that the change characteristics of RSS sources and the general user access behavior pose distinct requirements that make this task significantly different from the traditional index refresh problem for Web-search engines. Our studies on a collection of 10K RSS feeds reveal some general characteristics of the RSS feeds, and show that with proper resource allocation and scheduling the RSS aggregator provides news alerts significantly faster than the best existing approach. Index Terms— Information Search and Retrieval, Online In- formation Services, Performance evaluation, User profiles and alert services I. I NTRODUCTION Recently, there has been a dramatic increase in the use of XML data to deliver information over the Web. In particular, personal weblogs, news Web sites, and discussion forums are now delivering up-to-date postings to their subscribers using the RSS protocol [32]. To help users access new content in this RSS domain, a number of RSS aggregation services and blog search engines have appeared recently and are gaining popularity [2], [3], [33], [38]. Using these services, a user can either (1) specify the set of RSS sources that she interested in, so that the user is notified whenever new content appears at the sources (either through email or when the user logs in the service) or (2) conduct a keyword-based search to retrieve all content containing the keyword. Clearly, having a central access point makes it significantly simpler to discover and access new content from a large number of diverse RSS sources. A. Challenges and contributions In this paper, we investigate one of the important challenges in building an effective RSS aggregator: how can we minimize the delay between the publication of new content at a source Ka Cheung Sia and Junghoo Cho are with the Department of Com- puter Sicence, University of California, Los Angeles, CA 90095, USA. E- mail:{kcsia,cho}@cs.ucla.edu Hyun-Kyu Cho is with the Fundamental Intelligent Robot Research Team of Intelligent Robot Research Division, ETRI, 161 Gajeong-dong, Yuseong- gu, Daejeon, 305-700, Korea. E-mail: hkcho@etri.re.kr Manuscript received January 27, 2006; revised October 3, 2007; accepted January 11, 2007. and its appearance at the aggregator? Note that the aggregation can be done either at a desktop (e.g., RSS feed readers) or at a central server (e.g., Personalized Yahoo/Google homepage). While some of our developed techniques can be applied to the desktop-based aggregation, in this paper we primarily focus on the server-based aggregation scenario. This problem is similar to the index refresh problem for Web-search engines [7], [9], [11], [13], [15], [30], [31], [40], but two important properties of the information in the RSS domain make this problem unique and interesting: • The information in the RSS domain is often time sensi- tive. Most new RSS content is related to current world events, so its value and significance deteriorates rapidly as time passes. An effective RSS aggregator, therefore, has to retrieve new content quickly and make it available to its users close to real time. This requirement is in contrast to general Web search engines where the tempo- ral requirement is not as strict. For example, it is often acceptable to index a new Web page within, say, a month of its creation for the majority of Web pages. • For general search engines, users mainly focus on the quality of the returned pages and largely ignore (or not care about) what is not returned [22], [24]. Based on this observation, researchers have argued and mainly focused on improving the precision of the top-k result [30], and the page-refresh policies have also been designed to improve the freshness of the indexed pages. For RSS feeds, however, many users often have a set of their favorite sources and are particularly interested in reading the new content from these sources. Therefore, users do notice (and complain) if the new content from their favorite sources is missing from the aggregator. As we will see later, the time-sensitivity of the RSS domain fundamentally changes how we should model the generation of new content in this domain and makes it necessary to design a new content-monitoring policy. In the rest of this paper, we investigate the problem of how we can effectively monitor and retrieve time sensitive new content from the RSS domain as follows: • In Section II, we describe a formal framework for this problem. In particular, we propose a periodic inhomoge- neous Poisson process to model the generation of postings at the RSS feeds. We also propose to use the delay metric to evaluate the monitoring policies for RSS feeds. • In Section III, we develop the optimal ways to retrieve new content from RSS feeds through a careful analysis of the proposed model and metric. • In Section IV, we examine the general characteristics of the RSS feeds based on real RSS-feed data. We also evaluate the effectiveness of our retrieval policies