Hyperincident Connected Components of Tagging Networks Nicolas Neubauer & Klaus Obermayer Neural Information Processing Group Technische Universität Berlin, Franklinstr. 28/29, Berlin, Germany neubauer|oby@cs.tu-berlin.de ABSTRACT Data created by social bookmarking systems can be de- scribed as 3-partite 3-uniform hypergraphs connecting doc- uments, users, and tags (tagging networks), such that the toolbox of complex network analysis can be applied to ex- amine their properties. One of the most basic tools, the analysis of connected components, however cannot be ap- plied meaningfully: Tagging networks tend to be almost en- tirely connected. We therefore propose a generalization of connected components, m-hyperincident connected compo- nents. We show that decomposing tagging networks into 2-hyperincident connected components yields a characteris- tic component distribution with a salient giant component that can be found across various datasets. This pattern changes if the underlying formation process changes, for ex- ample, if the hypergraph is constructed from search logs, or if the tagging data is contaminated by spam: It turns out that the second- to 129th largest components of the spam-labeled Bibsonomy dataset are inhabited exclusively by spam users. Based on these ﬁndings, we propose and unsupervised method for spam detection. Categories and Subject Descriptors H.3.3 [Information Systems]: Information Search and Retrieval; I.2.6 [Artiﬁcial Intelligence]: Learning; G.2.2 [Graph Theory]: General Terms Algorithms, Experimentation 1. INTRODUCTION Tagging systems allow users to organize ressources by an- notating them with tags. The uniﬁed set of ressources and assigned tags across all users of such a system has been ex- tensively studied – complex patterns have been shown to emerge from these individual acts of information manage- ment (see, e.g., [6] for one of the earliest works in this ﬁeld). Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for proﬁt or commercial advantage and that copies bear this notice and the full citation on the ﬁrst page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior speciﬁc permission and/or a fee. HT’09, June 29–July 1, 2009, Torino, Italy. Copyright 2009 ACM 978-1-60558-486-7/09/06 ...$5.00. By interpreting each assignment of a tag t to a document d by a user u as an edge (d, u, t) of a hypergraph that we will call a “tagging network”, the toolbox of complex net- work analysis can be applied to the analysis of the emerging structures. For example, tagging behaviour was shown to create characteristic power law-like degree distributions in [2], with major deviations from these patterns caused by spam entries. In this work, we will regard the networks’ connectivity features and ﬁnd striking diﬀerences between spam-free and spammed networks. Distinguishing artiﬁcial bookmarking/tagging behaviour (such as spamming) from genuine human information man- agement activities can help sharpen our understanding of the underlying cognitive and social processes – it has, how- ever, also become a practical task. Social bookmarking sites now receive signiﬁcant attention from both users and search engines, creating incentives for spammers to penetrate these systems. When targeting users, spammers tag fake sites with popular tags, trying to trick users into visiting the posted site when they browse the entries for a given tag. Search engines can be targeted by tagging the promoted website with a random tag. Social bookmarking sites show a list of top entries for each tag, and for a page to be in the top (and probably only) position in such a list for any tag might lead search engines to boost that page’s ranking. Figure 1, using GraphViz[4], visualizes the top entries of the Bibsonomy social bookmarking dataset (containing manually classiﬁed spam and non-spam entries, described in detail below), ﬁrst without spam, then with spam included. We see subtle patterns in the clean data being overshadowed by the spam entries in the second plot. These two ﬁgures not only provide an example of spam in social bookmarking systems, but also motivate the basic assumption underlying our further work: We ﬁnd spammers do not only post diﬀer- ent websites with diﬀerent tags, but they behave diﬀerently in such a fundamental way that it structurally changes the resulting networks. The distribution of connected components and in partic- ular the existence and relative size of a so-called “giant com- ponent”, i.e., a single connected subgraph containing the largest number of nodes, is a common analysis technique to be applied on complex networks and can yield valuable insights into the underlying formation dynamics. Decom- posing tagging networks into their connected components however turns out to be uninformative: As we will show later on, they tend to consist of a single connected compo- nent containing more than 99,9% of all nodes. Facing this high degree of connectedness, we propose a generalization