DiscoWeb: Applying Link Analysis to Web Search Brian D. Davison, Apostolos Gerasoulis, Konstantinos Kleisouris, Yingfang Lu, Hyun-ju Seo, Wei Wang, and Baohua Wu Department of Computer Science, Rutgers University {davison,gerasoul,kkonst,luyf,hseo,ww,baowu}@cs.rutgers.edu How often does the search engine of your choice produce results that are less than satisfying, generating endless links to irrelevant pages even though those pages may contain the query keywords? How often are you given pages that tell you things you already know? While the search engines and related tools continue to make improvements in their information retrieval algorithms, for the most part they continue to ignore an essential part of the web – the links. We have found that link analysis can have significant contributions to web page retrieval from search engines, to web community discovery, and to the measurement of web page influence. It can help to rank results and find high-quality index/hub/link pages that contain links to the best sites on the topic of interest. Our work is based on research from IBM’s CLEVER project [7, 4, 6], Stanford’s Google [3], and the Web Ar- chaeology research [2, 1] at Compaq’s Systems Research Center. These research teams have demonstrated some of the contributions that link analysis can make in the web. In our work, we have attempted to generalize and improve upon these approaches. Just as in citation analysis of published works, the most influential documents on the web will have many other documents recommending (pointing to) them. This idea underlies all link analysis efforts, from the straightforward technique of counting the number of incoming edges to a page, to the deeper eigenvector analysis used in our work and in those projects mentioned above. It turns out that the identification of “high-quality” web pages reduces to a sparse eigenvalue of the adjacency matrix of the linked graph [7, 3]. In 1998, Kleinberg [7] provided the analysis and substantial evidence that each eigenvector creates a web clustering which he called “web communities”. The most important web community corresponds to the principal eigenvector and the component values within each eigenvector represent a ranking of web pages. Determin- ing the eigenvectors is computationally intensive since the linked graph can be quite large, e.g. each keyword search could result in millions of page hits. For this reason, the approach taken in most implementations is to determine only the principal eigenvector [7, 3], using the well-known power iterative method [5] for eigenvectors. If the initial ap- proximation is the unit vector then the first iteration in the power method corresponds to the counting of the incoming and/or outgoing edges for each page of the web graph. There are certain advantages and disadvantages of this approach that we discuss below: • In Google the power iterative method is applied off-line over the whole web graph. The ranking determined by the first eigenvector is then stored in the database. The major advantage of this approach is that there is no additional run-time link analysis penalty during a query search process. However, there is an open problem with this approach. The rankings will be dominated by “strong” web pages that are irrelevant to the specific search query. For example, pages such as excite.com will have a much higher ranking because they have a larger number of incoming edges as compared to e.g. city.net/countries/greece. So if excite.com is a node in the linked graph it will be ranked much higher than city.net/countries/greece when the query is greece. Obviously the node excite.com can be removed afterwards via text analysis, as it is done in the current version of Google, but the impact on its page ranking needs to be studied further. For example, it is unclear if the the ranking will remain the same if the power method is applied to the subgraph corresponding to the specific query, such as greece, in which all irrelevant pages have been removed first. • Compaq’s Connectivity Server and the CLEVER project also compute only the principal eigenvector, but on the linked graph of the neighborhood of pages resulting from a specific query. This substantially reduces the size of the linked graph and as a result it will converge much faster. However, unless the linked graph is weighted correctly the principal eigenvector will not represent the best ranking, as described by Bharat and Henzinger [1]. Even if the linked graph is correctly weighted, the resulting adjacency matrices could be reducible, e.g. the linked 1