Hyperincident Connected Components of Tagging Networks Nicolas Neubauer & Klaus Obermayer Neural Information Processing Group Technische Universität Berlin, Franklinstr. 28/29, Berlin, Germany neubauer|oby@cs.tu-berlin.de ABSTRACT Data created by social bookmarking systems can be de- scribed as 3-partite 3-uniform hypergraphs connecting doc- uments, users, and tags (tagging networks), such that the toolbox of complex network analysis can be applied to ex- amine their properties. One of the most basic tools, the analysis of connected components, however cannot be ap- plied meaningfully: Tagging networks tend to be almost en- tirely connected. We therefore propose a generalization of connected components, m-hyperincident connected compo- nents. We show that decomposing tagging networks into 2-hyperincident connected components yields a characteris- tic component distribution with a salient giant component that can be found across various datasets. This pattern changes if the underlying formation process changes, for ex- ample, if the hypergraph is constructed from search logs, or if the tagging data is contaminated by spam: It turns out that the second- to 129th largest components of the spam-labeled Bibsonomy dataset are inhabited exclusively by spam users. Based on these findings, we propose and unsupervised method for spam detection. Categories and Subject Descriptors H.3.3 [Information Systems]: Information Search and Retrieval; I.2.6 [Artificial Intelligence]: Learning; G.2.2 [Graph Theory]: General Terms Algorithms, Experimentation 1. INTRODUCTION Tagging systems allow users to organize ressources by an- notating them with tags. The unified set of ressources and assigned tags across all users of such a system has been ex- tensively studied – complex patterns have been shown to emerge from these individual acts of information manage- ment (see, e.g., [6] for one of the earliest works in this field). Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. HT’09, June 29–July 1, 2009, Torino, Italy. Copyright 2009 ACM 978-1-60558-486-7/09/06 ...$5.00. By interpreting each assignment of a tag t to a document d by a user u as an edge (d, u, t) of a hypergraph that we will call a “tagging network”, the toolbox of complex net- work analysis can be applied to the analysis of the emerging structures. For example, tagging behaviour was shown to create characteristic power law-like degree distributions in [2], with major deviations from these patterns caused by spam entries. In this work, we will regard the networks’ connectivity features and find striking differences between spam-free and spammed networks. Distinguishing artificial bookmarking/tagging behaviour (such as spamming) from genuine human information man- agement activities can help sharpen our understanding of the underlying cognitive and social processes – it has, how- ever, also become a practical task. Social bookmarking sites now receive significant attention from both users and search engines, creating incentives for spammers to penetrate these systems. When targeting users, spammers tag fake sites with popular tags, trying to trick users into visiting the posted site when they browse the entries for a given tag. Search engines can be targeted by tagging the promoted website with a random tag. Social bookmarking sites show a list of top entries for each tag, and for a page to be in the top (and probably only) position in such a list for any tag might lead search engines to boost that page’s ranking. Figure 1, using GraphViz[4], visualizes the top entries of the Bibsonomy social bookmarking dataset (containing manually classified spam and non-spam entries, described in detail below), first without spam, then with spam included. We see subtle patterns in the clean data being overshadowed by the spam entries in the second plot. These two figures not only provide an example of spam in social bookmarking systems, but also motivate the basic assumption underlying our further work: We find spammers do not only post differ- ent websites with different tags, but they behave differently in such a fundamental way that it structurally changes the resulting networks. The distribution of connected components and in partic- ular the existence and relative size of a so-called “giant com- ponent”, i.e., a single connected subgraph containing the largest number of nodes, is a common analysis technique to be applied on complex networks and can yield valuable insights into the underlying formation dynamics. Decom- posing tagging networks into their connected components however turns out to be uninformative: As we will show later on, they tend to consist of a single connected compo- nent containing more than 99,9% of all nodes. Facing this high degree of connectedness, we propose a generalization