Four Heuristics to Guide Structured Content Crawling
J¨ urgen Umbrich and Andreas Harth and Aidan Hogan and Stefan Decker
National University of Ireland, Galway
Digital Enterprise Research Institute
{firstname.lastname@deri.org}
Abstract
Search engines focusing on particular media types face
difficulties in discovering suitable URIs on the Web. Since
the engines are only interested in a small fraction of the
Web, a crawler should use heuristics to concentrate on that
fraction. To devise such a heuristic, we postulate four hy-
potheses based on RFCs and W3C recommendations to find
cues for certain content types. Tests on a corpus of 22m
files (793GB content size) containing 630m URIs show that
for the content types text, image, and application, the rec-
ommendations are mostly being followed, while results for
audio and video are much less consistent. Our findings and
recommendations can be implemented as heuristics for ef-
ficient discovery of structured content on the Web on top of
existing crawlers.
1 Introduction
While established search engines focus on hypertext
documents, in recent years a number of specialised search
engines have emerged that collect and integrate information
from files of particular media types. Seeqpod
1
and Blinkx
2
offer search over audio and video files, Google Scholar
3
and CiteSeer
4
are digital libraries of printable documents,
Technorati
5
provides real-time access to news-feeds, and
Seekda
6
offers search capabilities for web services. A new
generation of search engines with powerful query function-
ality are emerging such as SWSE
7
, Swoogle
8
, Watson
9
, Fal-
1
http://www.seeqpod.com/
2
http://www.blinkx.com/
3
http://scholar.google.com/
4
http://citeseer.ist.psu.edu/
5
http://technorati.com/
6
http://seekda.com/
7
http://swse.org/
8
http://swoogle.umbc.edu/
9
http://watson.kmi.open.ac.uk/WatsonWUI/
consearch
10
and Sindice
11
, which concentrate on semantic
web data. Common to all these specialised search engines is
that they rely only on documents of specialised media types.
A common issue for targeted search engines is how to
discover files of a certain media type on the Web [12].
Most of the search engines provide users with a form or
an API to encourage submissions of URIs. In addition they
use general Web search engines, like Google, MSN or Ya-
hoo, to harvest URIs using APIs and query constructs like
filetype or originurlextension, but public APIs
are restricted by result size and invocation frequency Fur-
thermore, the query constructs only enable searching for
a specified file extension instead of querying for the me-
dia type of the URIs. Since these methods do not provide
enough URIs, all the search engines must invest extra effort
traversing the Web in order to find additional sources.
A na¨ıve approach is to use a breadth-first crawler and
fetch the connected Web, starting from a seed set of URIs.
Search engines in 2005 indexed approximately 11.5 bil-
lion documents [8]. In February 2008, the indexed Web
was estimated to consist of 45 billion documents
12
. Given
that web-scale crawlers can fetch in the order of thousands
of pages per second [2], downloading the entire connected
Web would take years and require large amounts of CPU
power, network bandwidth and storage space. Also, pub-
lished studies showed that over 90% of the documents on
the Web are hypertext files [14] [7]. Therefore, a search
engine for a particular media type is focused only on a
small subset of the entire Web. Ideally a specialised crawler
would download only the relevant portion of the Web and
would save time and resources compared to the na¨ıve ap-
proach.
Crawling for HTML documents of a particular topic
of interest is related to the problem addressed by focused
crawling strategies. Focused crawling, as described in [11],
[4] or [3], uses decision rules based on content analysis, link
structure and anchor text to keep the crawler focused on a
10
http://www.falconsearch.com/
11
http://sindice.com/
12
http://www.worldwidewebsize.com/
Eighth International Conference on Web Engineering
978-0-7695-3261-5/08 $25.00 © 2008 IEEE
DOI 10.1109/ICWE.2008.42
196