Detecting Highly Overlapping Community Structure by Greedy Clique Expansion Conrad Lee, Fergal Reid, Aaron McDaid, Neil Hurley University College Dublin Clique Research Cluster Dublin 4, Ireland {conradlee,fergal.reid,aaronmcdaid}@gmail.com, neil.hurley@ucd.ie ABSTRACT In complex networks it is common for each node to belong to sev- eral communities, implying a highly overlapping community struc- ture. Recent advances in benchmarking indicate that the exist- ing community assignment algorithms that are capable of detect- ing overlapping communities perform well only when the extent of community overlap is kept to modest levels. To overcome this limitation, we introduce a new community assignment algorithm called Greedy Clique Expansion (GCE). The algorithm identifies distinct cliques as seeds and expands these seeds by greedily opti- mizing a local fitness function. We perform extensive benchmarks on synthetic data to demonstrate that GCE’s good performance is robust across diverse graph topologies. Significantly, GCE is the only algorithm to perform well on these synthetic graphs, in which every node belongs to multiple communities. Furthermore, when put to the task of identifying functional modules in protein inter- action data, and college dorm assignments in Facebook friendship data, we find that GCE performs competitively. Categories and Subject Descriptors: H.2.8 Database Manage- ment: Database Applications – Data Mining Keywords: Community Assignment, Overlapping, Local Cluster- ing Algorithm, Complex Networks 1. INTRODUCTION Community structure has been recognized in networks that come from a wide range of domains, such as social and biological net- works. While concrete definitions of community vary by domain, a community may generally be described as a set of nodes with dense internal connections, exhibiting comparatively sparse connections to the rest of the network. Knowledge of community structure can reveal functional organization in networks, much as identifying or- gans in the body can reveal the role of various tissues. In recent years, numerous community assignment algorithms (CAAs) have been suggested, as computer scientists and physicists have taken on the problem of algorithmicly finding communities (for an excel- lent recent review of the field, see Fortunato [1]). Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. The 4th SNA-KDD Workshop ’10 (SNA-KDD’10) July 25, 2010, Washing- ton, DC USA Copyright 2010 ACM 978-1-4503-0225-8 ...$10.00. Despite their proliferation, it is difficult to determine the perfor- mance of CAAs for two reasons. On the one hand, there is a lack of large empirical datasets where the a priori or ground truth com- munities are known; and on the other hand, most synthetic data— especially the most popular, the GN model [2]—is overly simplistic and unrealistic, lacking key topological features such as a hetero- geneous degree distribution, varied community sizes, and triadic closure, while also requiring that every node belong to exactly one community. The lack of realistic benchmark graphs has led to a sit- uation where researchers know that many algorithms perform well on simple networks, but are unaware how these perform on more complex empirical data. This problem is so pronounced that in his comprehensive review of the field, Fortunato states with regard to benchmarking: “...the issue of testing algorithms has received very little attention in the literature on graph clustering. This is a serious limit of the field. Because of that, it is still impossible to state which method (or sub- set of methods) is the most reliable in applications...” In the last year, Lancichinetti and Fortunato [3] have addressed this uncertainty by specifying a means of creating more realis- tic synthetic benchmark graphs, which have scale-free degree and community size distributions as well as overlapping communities. Using their specification (called LFR), they and others have subse- quently discovered—with a level of subtlety previously unattained— under what topological conditions a wide range of CAAs perform well or poorly [4, 5]. One surprising result revealed by this recent benchmarking is the poor performance of many CAAs when it comes to detecting mod- erately overlapping community structure. It is intuitive from our knowledge of real world domains that many complex networks will have communities that overlap, potentially to a high degree. Con- sider, for example, a social network site like Facebook. On average, a Facebook user has 130 “friends,” who typically belong to multi- ple distinct social groups [6]. These groups may correspond to ties formed in high-school, college, professional settings, and family. Figure 1, which depicts the ego-centric network of a Facebook user, demonstrates this tendency for a user to belong to multiple groups. The analysis of Marlow et al. [7] suggests that the groups appar- ent in this user’s ego-centric network correspond to acquaintances formed at different stages of life, and that most of these groups are dormant. Clearly, if this type of ego-centric network is typical of Facebook users, then any CAA that partitions nodes into non- overlapping communities (henceforth, non-overlapping CAA) will perform poorly: such CAAs can assign each node to only one of its many communities. Similarly, in complex networks of interactions between proteins, it has been claimed that many proteins belong to multiple communities, each of which in turn corresponds to some biological function [8, 9]. Since 2005, the year in which Palla et al. arXiv:1002.1827v2 [physics.data-an] 15 Jun 2010