Searching for Interestingness in Wikipedia and Yahoo! Answers Yelena Mejova 1 Ilaria Bordino 2 Mounia Lalmas 3 Aristides Gionis 4 {1,2,3} Yahoo! Research Barcelona, Spain 4 Aalto University, Finland { 1 ymejova, 2 bordino, 3 mounia}@yahoo-inc.com 4 aristides.gionis@aalto.fi ABSTRACT In many cases, when browsing the Web, users are searching for specific information. Sometimes, though, users are also looking for something interesting, surprising, or entertain- ing. Serendipitous search puts interestingness on par with relevance. We investigate how interesting are the results one can obtain via serendipitous search, and what makes them so, by comparing entity networks extracted from two promi- nent social media sites, Wikipedia and Yahoo! Answers. Categories and Subject Descriptors H.4 [Information Systems Applications]: Miscellaneous Keywords Serendipity, Exploratory search 1. INTRODUCTION Serendipitous search occurs when a user with no a priori or totally unrelated intentions interacts with a system and acquires useful information [4]. A system supporting such exploratory capabilities must provide results that are rele- vant to the user’s current interest, and yet interesting, to encourage the user to continue the exploration. In this work, we describe an entity-driven exploratory and serendipitous search system, based on enriched entity net- works that are explored through random-walk computations to retrieve search results for a given query entity. We extract entity networks from two datasets, Wikipedia, a curated, collaborative online encyclopedia, and Yahoo! Answers, a more unconstrained question/answering forum, where the freedom of conversation may present advantages such as opinions, rumors, and social interest and approval. We compare the networks extracted from the two media by performing user studies in which we juxtapose interest- ingness of the results retrieved for a query entity, with rel- evance. We investigate whether interestingness depends on (i) the curated/uncurated nature of the dataset, and/or on (ii) additional characteristics of the results, such as senti- ment, content quality, and popularity. Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. WWW 2013 Companion, May 13–17, 2013, Rio de Janeiro, Brazil. Copyright 2013 ACM 978-1-4503-2038-2/13/05 ...$15.00. 2. ENTITY NETWORKS We extract entity networks from (i) a dump of the En- glish Wikipedia from December 2011 consisting of 3 795 865 articles, and (ii) a sample of the English Yahoo! Answers dataset from 2010/2011, containing 67 336 144 questions and 261 770 047 answers. We use state-of-the-art methods [3, 5] to extract entities from the documents in each dataset. Next we draw an arc between any two entities e1 and e2 that co-occur in one or more documents. We assign the arc a weight w1(e1,e2)= DF(e1,e2) equal to the number of such documents (the document frequency (DF) of the entity pair). This weighting scheme tends to favor popular entities. To mitigate this effect, we measure the rarity of any entity e in a dataset by computing its inverse document frequency IDF(e) = log(N ) - log(DF(e)), where N is the size of the col- lection, and DF(e) is the document frequency of entity e. We set a threshold on IDF to drop the arcs that involve the most popular entities. We also rescale the arc weights according to the alternative scheme w2(e1 → e2)= DF(e1,e2) · IDF(e2). We use Personalized PageRank (PPR) [1] to extract the top n entities related to a query entity. We consider two scoring methods. When using the w2 weighting scheme, we simply use the PPR scores (we dub this method IDF). When using the simpler scheme w1, we normalize the PPR scores by the global PageRank scores (with no personalization) to penalize popular entities. We dub this method PN. We enrich our entity networks with metadata regard- ing sentiment and quality of the documents. Using Sen- tiStrength 1 , we extract sentiment scores for each document. We calculate attitude and sentimentality metrics [2] to mea- sure polarity and strength of the sentiment. Regarding qual- ity, for Yahoo! Answers documents we count the number of points assigned by the system to the users, as indication of expertise and thus good quality. For Wikipedia, we count the number of dispute messages inserted by editors to require revisions, as indication of bad quality. We derive sentiment and quality scores for any entity by averaging over all the documents in which the entity appears. We use Wikimedia 2 statistics to estimate the popularity of entities. 3. EXPLORATORY SEARCH We test our system using a set of 37 queries originat- ing from 2010 and 2011 Google Zeitgeist (www.google.com/ zeitgeist) and having sufficient coverage in both datasets. Using one of the two algorithms – PN or IDF – we retrieve the top five entities from each dataset – YA or WP – for each 1 sentistrength.wlv.ac.uk 2 dumps.wikimedia.org/other/pagecounts-raw