JULY/AUGUST 2004 43 A lthough we can think of the Web as a huge semistructured database that provides us with a vast amount of in- formation, no one knows exactly how many Web pages are out there. Google reports more than 3.3 billion textual documents indexed up to September 2003, but that same month had at least 5.2 billion documents with the word “the” in their Google listings (www.webmaster world.com/forum3/16779.htm). We can assume that many additional documents and Web pages—perhaps in other languages—do not con- tain the word “the.” Most people believe they can easily find the in- formation they’re looking for on the Web. They simply browse from the prelisted entry points in hi- erarchical directories (like yahoo.com) or start with a list of keywords in a search engine. However, many Web information services deliver inconsistent, inac- curate, incomplete, and often irrelevant results. For many reasons, existing Web search tech- niques have significant deficiencies with respect to robustness, flexibility, and precision. For example, although general search engines crawl and index thousands of Web pages (the so-called surface Web), they typically ignore valuable pages that require au- thorization or prior registration—the ones whose contents are not directly available for crawling through links. This is the hidden (or deep or invisi- ble) Web. Public information on the hidden Web is currently estimated to be 400 to 550 times larger than the surface Web. 1 Another unpleasant feature of the Web is its volatility. Web documents typically undergo two kinds of change. The first—persistence—is the exis- tence or disappearance of Web pages and sites dur- ing a Web document’s life cycle. According to one study, 2 a Web page’s “half-life” seems to be somewhat less than two years, with a Web site’s half-life being somewhat more than two years. The second type of change is page or site content modification. Another study 3 notes that 23 percent of of all Web pages change daily (40 percent of commercial Web pages change daily); it also reports a half-life of 10 days for the commercial Web pages. Some pages disappear completely, though, which means the data gathered by a search engine can quickly become stale or out of date. Crawlers must regularly revisit Web pages to maintain the freshness of the search engine’s data. The first Web information services were based on traditional information retrieval (IR) algorithms and techniques (a critical summary and review appears elsewhere 4 ). However, most IR algorithms were de- WEB S EARCHING AND I NFORMATION R ETRIEVAL W EB E NGINEERING The first Web information services were based on traditional information retrieval algorithms, which were originally developed for smaller, more coherent collections than the Web. Due to the Web’s continued growth, today’s Web searches require new techniques— exploiting or extending linkages among Web pages, for example. JAROSLAV POKORN ´ Y Charles University 1521-9615/04/$20.00 © 2004 IEEE Copublished by the IEEE CS and the AIP