New Search Technologies Mine the Web More Deeply - NYTimes.com http://www.nytimes.com/2009/02/23/technology/internet/23search.html?_r=1&th=&emc=th&pagewanted=print[23-Feb-09 06:24:19] This copy is for your personal, noncommercial use only. You can order presentation- ready copies for distribution to your colleagues, clients or customers here or use the "Reprints" tool that appears next to any article. Visit www.nytreprints.com for samples and additional information. Order a reprint of this article now. February 23, 2009 Exploring a ‘Deep Web’ That Google Can’t Grasp By ALEX WRIGHT One day last summer, Google ’s search engine trundled quietly past a milestone. It added the one trillionth address to the list of Web pages it knows about. But as impossibly big as that number may seem, it represents only a fraction of the entire Web. Beyond those trillion pages lies an even vaster Web of hidden data: financial information, shopping catalogs, flight schedules, medical research and all kinds of other material stored in databases that remain largely invisible to search engines. The challenges that the major search engines face in penetrating this so-called Deep Web go a long way toward explaining why they still can’t provide satisfying answers to questions like “What’s the best fare from New York to London next Thursday?” The answers are readily available — if only the search engines knew how to find them. Now a new breed of technologies is taking shape that will extend the reach of search engines into the Web’s hidden corners. When that happens, it will do more than just improve the quality of search results — it may ultimately reshape the way many companies do business online. Search engines rely on programs known as crawlers (or spiders) that gather information by following the trails of hyperlinks that tie the Web together. While that approach works well for the pages that make up the surface Web, these programs have a harder time penetrating databases that are set up to respond to typed queries. “The crawlable Web is the tip of the iceberg,” says Anand Rajaraman, co-founder of Kosmix ( www.kosmix.com ), a Deep Web search start-up whose investors include Jeffrey P. Bezos , chief executive of Amazon.com . Kosmix has developed software that matches searches with the databases most likely to yield relevant information, then returns an overview of the topic drawn from multiple sources. “Most search engines try to help you find a needle in a haystack,” Mr. Rajaraman said, “but what we’re trying to do is help you explore the haystack.” That haystack is infinitely large. With millions of databases connected to the Web, and endless possible permutations of search terms, there is simply no way for any search engine — no matter how powerful — to sift through every possible combination of data on the fly.