Yokie - A Curated, Real-time Search & Discovery System using Twitter Owen Phelan, Kevin McCarthy and Barry Smyth CLARITY Centre for Sensor Web Technologies School Of Computer Science & Informatics University College Dublin Email: ﬁrstname.lastname@ucd.ie ABSTRACT Social networks and the Real-time Web (RTW) have joined Search and Discovery as central pillars of online human ac- tivities. These are staple venues of interaction, with vast social graphs facilitating messaging and sharing of informa- tion. Twitter 1 , for example, boasts 200 million users post- ing over 150 million messages every day. Such volumes of content being disseminated make for a tempting source of relevant content on the web. In this paper, we presentYokie, a novel search and discovery system that sources its index from the shared URL’s of a curated selection of Twitter users. The added beneﬁt of this method is that tweets con- taining these URL’s contain extra contextual information, such as terms describing the URL, publishing time, down to the Tweet metadata which can include location and user data. Also, since we are exploiting a social graph structure of content sharing, it is possible to explore novel reputation ranking of content. The mixture of contextual data, with the fundamental harnessing of sharing activities amongst a curated set of users combine to produce a novel system that, with an initial online user trial, has shown promising results. Categories and Subject Descriptors H.4 [Information Systems Applications]: Miscellaneous General Terms Algorithms, Experimentation, Theory Keywords Search, Discovery, Information Retrieval, Relevance, Repu- tation, Twitter 1 Twitter - http://www.twitter.com Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for proﬁt or commercial advantage and that copies bear this notice and the full citation on the ﬁrst page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior speciﬁc permission and/or a fee. Copyright 20XX ACM X-XXXXX-XX-X/XX/XX ...$10.00. Tweet count Tweet count (with URL) % Sample 1 Sample 2 Sample 3 Sample 4 Sample 5 54221 11964 22.065251 1411784 331445 23.47703 6924205 1539323 22.231043 7453870 1647295 22.09986 60042573 13113525 21.840378 Average: 22.468298 Std. Dev. 0.67627121 Figure 1: Analysis of 5 public Twitter datasets of varying sizes consisting of public tweets, with per- centage of Tweets containing URL’s. Datasets gath- ered at various points between 2009 and 2011. Sets 1,2 and 3 were focussed scrapes, speciﬁc to a set of hashtags. Sets 4 and 5 were general public scrapes of the Twitter ﬁrehose. 1. INTRODUCTION Google 2 , Bing 3 and Yahoo! 4 are household tools for ﬁnd- ing relevant items on the web, of varying quality and rele- vance to the users search query or task. These systems rely on the use of automatic software “crawlers” that build query- able indexes by navigating the web of documents. These crawlers index documents based on their content, ﬁnd edges between each document (hyperlinks), and perform a set of weighting and relevance calculations to decide on hubs and authorities of the web, while improving index quality[3]. More recently, search systems have started to introduce context into their ranking and retrieval strategies, such as lo- cation and time of document publication. These are mostly content-based (related to documents actual content), as it is diﬃcult for a web crawler to determine the precise contex- tual features of a web document. Social networks are an abundant resource of social activ- ity and discussion. In the case of Twitter, we estimate an average rate of 22% of Twitter tweets contain a hyperlink to a document (analysis shown in Figure 1). 2 Google - http://www.google.com 3 Microsoft Bing - http://www.bing.com 4 Yahoo! - http://www.yahoo.com