Efficient Discovery of Authoritative Resources
Ravi Kumar #1, Kevin Lang #1, Cameron Marlow #2, Andrew Tomkins #1
#
Yahoo! Research, 701 First Ave., Sunnyvale, CA 94089, USA
1{ravikumar,langk,atomkins}@yahoo-inc.com
2cameron@media.mit.edu
Abstract- Given a dynamic corpus whose content and atten-
tion are changing on a daily basis, is it possible to collect and
maintain the high-quality resources with a minimal investment?
We address two problems that arise from this question for hy-
perlinked corpora such as web pages or blogs: how to efficiently
discover the correct set of authoritative resources given a fixed
network, and how to track these resources over time as new
entrants arrive, old standbys depart, and existing participants
change roles.
I. INTRODUCTION
We study the feasibility of gathering and maintaining highly
authoritative content, at a scale several orders of magnitude
smaller than the entire corpus. This approach is especially ap-
propriate when the long tail of little-accessed content does not
provide sufficient value to justify gathering and maintaining it.
The abstraction we consider is the following: given a
dynamic corpus where content and attention is changing on
a daily basis, is it possible to collect and maintain the high-
quality resources with a minimal investment? We address
two problems arising from this question, in the context of a
hyperlinked corpus such as web pages or blogs.
Static: How to efficiently discover the correct set of author-
itative resources given a fixed network?
Dynamic: How to track authoritative resources over time
as new entrants arrive, old standbys depart, and existing
participants change roles?
More specifically, we focus on a directed graph setting,
where the resources are described by nodes and hyperlinks
by edges. We consider various models of accessing this graph,
namely, sampling nodes at random, performing a random walk,
or crawling. Our measurement of authoritativeness is indegree.
In the static setting, we propose a spectrum of algorithms,
ranging from idealistic to realistic. We conduct extensive
experiments on a blog data set and show that our realistic al-
gorithms are very effective in identifying authoritative sources.
Furthermore, they are practical and are highly competitive with
their idealistic counterparts in terms of performance. Next,
we turn to the dynamic setting, where the graph changes
over time. In this setting we investigate the algorithms from
the static case, augmented with several recrawl policies. Our
experiments on another blog data set continuously crawled
over time once again shows that it is possible to devise
practical algorithms that are effective in tracking authoritative
nodes in this graph.
Related work. Web crawling is a well-studied topic, and can
be roughly subdivided into the problems of the discovery
of new content [ 1] and refresh of updated content from
previously visited resources. In the process of exploring the
web, different strategies have been applied to the problem of
ordering unvisited content; some have focused on the relative
importance of topics [2], [3] while others have used the
perceived rank [4], [5]. To effectively refresh updated pages
some have modeled the change rates of individual pages on
the web [6], [7], [8] while others have seen this as a problem
of keeping a given set of content up-to-date [9], [10], [11],
[12]. These approaches aim to characterize the web at large,
attending to comprehensiveness and coverage; we, instead,
focus on finding and tracking the most important content.
The process of identifying top resources in a web context
is also analogous to the task of identifying and threading
together important topics in a stream of information. Much
attention has been paid to this problem through the iterative
improvements on a fixed corpora on two sub-problems: topic
detection, or the identification of emerging, related concepts
and event tracking, or tracking subsequent events given an
initial set [13], [14]. Similar efforts have also been made to
identify bursty events in streams of data [15].
II. DISCOVERY IN STATIC GRAPHS
In this section we discuss the basic problem of discovering
authoritative nodes in a fixed graph. We present several algo-
rithms for this problem and analyze the performance of our
algorithms on a large blog data set.
Problem. Let G = (V, E) be a directed graph with node set
V and edge set E = {(u, v) u, v C V}. The inlinks of a
node u are the edges { (v, u) (v, u) C E}; the size of this
set is the indegree of u, denoted id(u). The outlinks of a node
u are the edges {(U, v) (u, v) C E}; the size of this set
is the outdegree of a, denoted od(u). We will define a node
to be "authoritative" if it has a high indegree; we note that
indegree is easy to interpret and compute, and it cannot be
easily changed by the node itself.
The problem of finding high-quality nodes in a graph can be
stated as follows: given a directed graph G and a target number
k, find the k most authoritative nodes. Exactly finding that
subset K will be difficult, so instead we ask for an approximate
subset C c V,
k < C «
< lV to maximize the quantity
Cc n K 1/ K l, called the crawl performance. A stricter measure
that might be more relevant in real applications is the rank
performance, which is given by KH n K 1/1 K l, where KH C
C with KH = k is the output of the algorithm.
978-1-4244-1837-4/08/$25.00 (© 2008 IEEE 1495 ICDE 2008