Incremental Cosine Computations for Search and Exploration of Tag Spaces Raymond Vermaas, Damir Vandic, and Flavius Frasincar Erasmus University Rotterdam PO Box 1738, NL-3000 DR, Rotterdam, the Netherlands info@raymondvermaas.nl, {vandic, frasincar}@ese.eur.nl Abstract. Tags are often used to describe user-generated content on the Web. However, the available Web applications are not incrementally deal- ing with new tag information, which negatively influences their scalabil- ity. Since the cosine similarity between tags represented as co-occurrence vectors is an important aspect of these frameworks, we propose two ap- proaches for an incremental computation of cosine similarities. The first approach recalculates the cosine similarity for new tag pairs and existing tag pairs of which the co-occurrences has changed. The second approach computes the cosine similarity between two tags by reusing, if available, the previous cosine similarity between these tags. Both approaches com- pute the same cosine values that would have been obtained when a com- plete recalculation of the cosine similarities is performed. The performed experiments show that our proposed approaches are between 1.2 and 23 times faster than a complete recalculation, depending on the number of co-occurrence changes and new tags. 1 Introduction User-based content is becoming increasingly available on the Web. This content is often annotated using tags and then uploaded on social sites, like the photo sharing service Flickr. Because users can choose any tag they like, there is a large amount of unstructured tag data available on the Web. The unstructured nature of these tags makes it hard to find content using current search methods, which are based on lexical matching. For example, if a user searches for “Apple”, (s)he could be looking for the fruit or for the company that makes the iPod. There are several approaches available that aim to solve the previously iden- tified problem [2, 4, 9–11]. In this paper, we focus on the Semantic Tag Clustering Search (STCS) framework [4, 9, 11]. The STCS framework utilizes two types of clustering techniques that allow for easier search and exploration of tag spaces. First, syntactic clustering is performed by using a graph clustering algorithm that employs the Levenstein distance measure in order to compute the dissimi- larity between tags. As result of syntactic clustering, e.g., terms like “waterfal”, “waterfall”, and “waterfalls” are clustered. This means that when a user searches for one of these terms, all the terms that are syntactically associated will show up in the results. Second, semantic tag clustering is performed, where the aim is