JULY/AUGUST 2004 43
A
lthough we can think of the Web as a
huge semistructured database that
provides us with a vast amount of in-
formation, no one knows exactly how
many Web pages are out there. Google reports
more than 3.3 billion textual documents indexed
up to September 2003, but that same month had
at least 5.2 billion documents with the word
“the” in their Google listings (www.webmaster
world.com/forum3/16779.htm). We can assume
that many additional documents and Web
pages—perhaps in other languages—do not con-
tain the word “the.”
Most people believe they can easily find the in-
formation they’re looking for on the Web. They
simply browse from the prelisted entry points in hi-
erarchical directories (like yahoo.com) or start with
a list of keywords in a search engine. However, many
Web information services deliver inconsistent, inac-
curate, incomplete, and often irrelevant results.
For many reasons, existing Web search tech-
niques have significant deficiencies with respect to
robustness, flexibility, and precision. For example,
although general search engines crawl and index
thousands of Web pages (the so-called surface Web),
they typically ignore valuable pages that require au-
thorization or prior registration—the ones whose
contents are not directly available for crawling
through links. This is the hidden (or deep or invisi-
ble) Web. Public information on the hidden Web is
currently estimated to be 400 to 550 times larger
than the surface Web.
1
Another unpleasant feature of the Web is its
volatility. Web documents typically undergo two
kinds of change. The first—persistence—is the exis-
tence or disappearance of Web pages and sites dur-
ing a Web document’s life cycle. According to one
study,
2
a Web page’s “half-life” seems to be somewhat
less than two years, with a Web site’s half-life being
somewhat more than two years. The second type of
change is page or site content modification. Another
study
3
notes that 23 percent of of all Web pages
change daily (40 percent of commercial Web pages
change daily); it also reports a half-life of 10 days for
the commercial Web pages. Some pages disappear
completely, though, which means the data gathered
by a search engine can quickly become stale or out
of date. Crawlers must regularly revisit Web pages
to maintain the freshness of the search engine’s data.
The first Web information services were based on
traditional information retrieval (IR) algorithms and
techniques (a critical summary and review appears
elsewhere
4
). However, most IR algorithms were de-
WEB S EARCHING
AND I NFORMATION R ETRIEVAL
W EB
E NGINEERING
The first Web information services were based on traditional information retrieval
algorithms, which were originally developed for smaller, more coherent collections than the
Web. Due to the Web’s continued growth, today’s Web searches require new techniques—
exploiting or extending linkages among Web pages, for example.
JAROSLAV POKORN
´
Y
Charles University
1521-9615/04/$20.00 © 2004 IEEE
Copublished by the IEEE CS and the AIP