Four Heuristics to Guide Structured Content Crawling J¨ urgen Umbrich and Andreas Harth and Aidan Hogan and Stefan Decker National University of Ireland, Galway Digital Enterprise Research Institute {ﬁrstname.lastname@deri.org} Abstract Search engines focusing on particular media types face difﬁculties in discovering suitable URIs on the Web. Since the engines are only interested in a small fraction of the Web, a crawler should use heuristics to concentrate on that fraction. To devise such a heuristic, we postulate four hy- potheses based on RFCs and W3C recommendations to ﬁnd cues for certain content types. Tests on a corpus of 22m ﬁles (793GB content size) containing 630m URIs show that for the content types text, image, and application, the rec- ommendations are mostly being followed, while results for audio and video are much less consistent. Our ﬁndings and recommendations can be implemented as heuristics for ef- ﬁcient discovery of structured content on the Web on top of existing crawlers. 1 Introduction While established search engines focus on hypertext documents, in recent years a number of specialised search engines have emerged that collect and integrate information from ﬁles of particular media types. Seeqpod 1 and Blinkx 2 offer search over audio and video ﬁles, Google Scholar 3 and CiteSeer 4 are digital libraries of printable documents, Technorati 5 provides real-time access to news-feeds, and Seekda 6 offers search capabilities for web services. A new generation of search engines with powerful query function- ality are emerging such as SWSE 7 , Swoogle 8 , Watson 9 , Fal- 1 http://www.seeqpod.com/ 2 http://www.blinkx.com/ 3 http://scholar.google.com/ 4 http://citeseer.ist.psu.edu/ 5 http://technorati.com/ 6 http://seekda.com/ 7 http://swse.org/ 8 http://swoogle.umbc.edu/ 9 http://watson.kmi.open.ac.uk/WatsonWUI/ consearch 10 and Sindice 11 , which concentrate on semantic web data. Common to all these specialised search engines is that they rely only on documents of specialised media types. A common issue for targeted search engines is how to discover ﬁles of a certain media type on the Web [12]. Most of the search engines provide users with a form or an API to encourage submissions of URIs. In addition they use general Web search engines, like Google, MSN or Ya- hoo, to harvest URIs using APIs and query constructs like filetype or originurlextension, but public APIs are restricted by result size and invocation frequency Fur- thermore, the query constructs only enable searching for a speciﬁed ﬁle extension instead of querying for the me- dia type of the URIs. Since these methods do not provide enough URIs, all the search engines must invest extra effort traversing the Web in order to ﬁnd additional sources. A na¨ıve approach is to use a breadth-ﬁrst crawler and fetch the connected Web, starting from a seed set of URIs. Search engines in 2005 indexed approximately 11.5 bil- lion documents [8]. In February 2008, the indexed Web was estimated to consist of 45 billion documents 12 . Given that web-scale crawlers can fetch in the order of thousands of pages per second [2], downloading the entire connected Web would take years and require large amounts of CPU power, network bandwidth and storage space. Also, pub- lished studies showed that over 90% of the documents on the Web are hypertext ﬁles [14] [7]. Therefore, a search engine for a particular media type is focused only on a small subset of the entire Web. Ideally a specialised crawler would download only the relevant portion of the Web and would save time and resources compared to the na¨ıve ap- proach. Crawling for HTML documents of a particular topic of interest is related to the problem addressed by focused crawling strategies. Focused crawling, as described in [11], [4] or [3], uses decision rules based on content analysis, link structure and anchor text to keep the crawler focused on a 10 http://www.falconsearch.com/ 11 http://sindice.com/ 12 http://www.worldwidewebsize.com/ Eighth International Conference on Web Engineering 978-0-7695-3261-5/08 $25.00 © 2008 IEEE DOI 10.1109/ICWE.2008.42 196