Can Social Bookmarking Enhance Search in the Web? Yusuke Yanbe Adam Jatowt Satoshi Nakamura Katsumi Tanaka Department of Social Informatics, Kyoto University Yoshida-Honmachi, Sakyo-ku 606-8501 Kyoto, Japan Phone: +81-75-753-5969 {yanbe, adam, nakamura, tanaka}@dl.kuis.kyoto-u.ac.jp ABSTRACT Social bookmarking is an emerging type of a Web service that helps users share, classify, and discover interesting resources. In this paper, we explore the concept of an enhanced search, in which data from social bookmarking systems is exploited for enhancing search in the Web. We propose combining the widely used link-based ranking metric with the one derived using social bookmarking data. First, this increases the precision of a standard link-based search by incorporating popularity estimates from aggregated data of bookmarking users. Second, it provides an opportunity for extending the search capabilities of existing search engines. Individual contributions of bookmarking users as well as the general statistics of their activities are used here for a new kind of a complex search where contextual, temporal or sentiment-related information is used. We investigate the usefulness of social bookmarking systems for the purpose of enhancing Web search through a series of experiments done on datasets obtained from social bookmarking systems. Next, we show the prototype system that implements the proposed approach and we present some preliminary results. Categories and Subject Descriptors H.3.3 [Information Storage and Retrieval]: Information Search and Retrieval – Query formulation, Retrieval models, Search Process General Terms Algorithms, Experimentation, Theory Keywords social bookmarking, social search, PageRank, metadata 1. INTRODUCTION Information retrieval (IR) has the objective of obtaining relevant documents from document collections given queries provided by users. Traditionally, vector space model based on popular TF*IDF measure [19] was used for finding relevant documents. This approach works well in finite and controlled environments like document collections. However, in huge and uncontrolled environments like Web, a simple content-based retrieval method such as the vector space model is impractical. In the Web, large quantities of resources constantly compete for the attention of users. Many of them are indeed relevant to user queries (which are usually very short and often ambiguous). However, the sheer size of the Web does not allow for presenting the entire set of related documents to users. In such an environment the quality of documents starts to play an important role, yet measuring this quality basing solely on page content is difficult and may be also subjective. The link structure of the Web provides better means for estimating page qualities. PageRank [18] is the most famous method that uses link structure analysis. The idea behind PageRank algorithm is to exploit the macro-scale link structure between pages in order to capture the popularity of documents and indirectly their qualities. According to this approach, the popularity of a page is determined on the basis of the size of a hypothetical user stream coming to the page. However, link-based algorithms have currently many disadvantages [14], for example they are vulnerable to spamming, it is often difficult to create links for average users, links may have variety of meanings and purposes, etc. Therefore, despite the previous success of link- based search algorithms, their current limitations cause that new, better approaches need to be sought. With the advent of Web 2.0, social bookmarking systems seem to have a potential for improving search capabilities of current search engines. In these systems, the popularity of a Web page is calculated as the total number of times it has been bookmarked, hence, by the number of users voting for the page. We call this measure SBRank. There are many differences between PageRank and SBRank that result from their characteristics. SBRank captures the popularity of resources among content consumers (page readers), while PageRank is in general a result of author-to- author evaluation of Web resources. This means that users who are not capable of creating and managing Web pages could also give “votes” to pages by creating social bookmarks. This situation is probably one of the causes of different temporal characteristics of both metrics. Generally, SBRank is more dynamic than PageRank, and it often takes short time for pages to reach their Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. JCDL’07, June 17–22, 2007, Vancouver, British Columbia, Canada. Copyright 2007 ACM 978-1-59593-644-8/07/0006...$5.00.