2048
Copyright © 2010, IGI Global. Copying or distributing in print or electronic forms without written permission of IGI Global is prohibited.
Chapter 7.12
Search Engine-Based Web
Information Extraction
Gijs Geleijnse
Philips Research, The Netherlands
Jan Korst
Philips Research, The Netherlands
AbstrAct
In this chapter we discuss approaches to find, extract,
and structure information from natural language
texts on the Web. Such structured information can
be expressed and shared using the standard Semantic
Web languages and hence be machine interpreted.
In this chapter we focus on two tasks in Web infor-
mation extraction. The first part focuses on mining
facts from the Web, while in the second part, we
present an approach to collect community-based
meta-data. A search engine is used to retrieve po-
tentially relevant texts. From these texts, instances
and relations are extracted. The proposed approaches
are illustrated using various case-studies, showing
that we can reliably extract information from the
Web using simple techniques.
IntroductIon
Suppose we are interested in ‘the countries where
Burger King can be found’, ‘the Dutch cities with
a university of technology’ or perhaps ‘the genre of
the music of Miles Davis’. For such diverse factual
information needs, the World Wide Web in general
and a search engine in particular can provide a solu-
tion. Experienced users of search engines are able
to construct queries that are likely to access docu-
ments containing the desired information. However,
current search engines retrieve Web pages, not the
information itself
1
. We have to search within the
search results in order to acquire the information.
Moreover, we make implicit use of our knowledge
(e.g. of the language and the domain), to interpret
the Web pages.
DOI: 10.4018/978-1-60566-112-4.ch009