On Tag Spell Checking Franco Maria Nardini 2 , Fabrizio Silvestri 2 , Hossein Vahabi 1,2 , Pedram Vahabi 4 , and Ophir Frieder 3 1 IMT, Lucca, Italy 2 ISTI-CNR, Pisa, Italy 3 Department of Computer Science Georgetown University, Washington DC, USA 4 University of Modena and Reggio Emilia, Modena, Italy Abstract. Exploiting the cumulative behavior of users is a common technique used to improve many popular online services. We build a tag spell checker using a graph-based model. In particular, we present a novel technique based on the graph of tags associated with objects made available by online sites such as Flickr and YouTube. We show the effectiveness of our approach on the basis of an experimentation done on real-world data. We show a precision of up to 93% with a recall (i.e., the number of errors detected) of up to 100%. 1 Introduction Differing from query spell checking, the goal of tag spelling correction is to enable the tagged object to be actually retrieved. Correcting “hip hop” as “hip- hop”, when the latter is more frequent than the former, is a good way to al- low people to find the resource when querying for the concept “hip-hop” 1 . By tagging a resource, a user wants that resource to be easily found. When query- ing, a user formulates a sentence-like text to retrieve the desired concept and to satisfy her/his information need. On the other hand, with tags, users leave breadcrumbs ” for others to detect. Like “breadcrumbs ”, tags do not have any particular inter-relationship apart from the fact that they were left by the same user. We exploit the collective knowledge [1,2] of users to build a spell checking system on tags. The main challenge is to enable tag spell checkers to manage sets of terms (with their relative co-occurrence patterns) instead of strings of terms, namely, queries. Much previous work is devoted to query spell checking. Differing from queries, namely short strings made up of two or three terms, tags are sets of about ten terms per resource. We exploit this relatively high number of tags per resource to provide correct spelling for tags. Indeed, our method exploits correlation be- tween tags associated with the same resource. We are able to detect and correct 1 The hidden assumption we do is that people formulate queries for resources, following the same mental process as people tagging resources. E. Chavez and S. Lonardi (Eds.): SPIRE 2010, LNCS 6393, pp. 37–42, 2010. c Springer-Verlag Berlin Heidelberg 2010