1 Detecting Communities Around Seed Nodes in Complex Networks Christian L. Staudt, Yassine Marrakchi, Henning Meyerhenke Department of Informatics, Karlsruhe Institute of Technology (KIT) christian.staudt@kit.edu, yassine.marrakchi@student.kit.edu, meyerhenke@kit.edu Abstract—The detection of communities (internally dense sub- graphs) is a network analysis task with manifold applications. The special task of selective community detection is concerned with finding high-quality communities locally around seed nodes. Given the lack of conclusive experimental studies, we perform a systematic comparison of different previously published as well as novel methods. In particular we evaluate their performance on large complex networks, such as social networks. Algorithms are compared with respect to accuracy in detecting ground truth communities, community quality measures, size of communities and running time. We implement a generic greedy algorithm which subsumes several previous efforts in the field. Experimental evaluation of multiple objective functions and optimizations shows that the frequently proposed greedy approach is not adequate for large datasets. As a more scalable alternative, we propose selSCAN, our adaptation of a global, density-based community detection algorithm. In a novel combination with algebraic distances on graphs, query times can be strongly reduced through preprocessing. However, selSCAN is very sensi- tive to the choice of numeric parameters, limiting its practicality. The random-walk-based PageRankNibble emerges from the comparison as the most successful candidate. I. I NTRODUCTION Problem Motivation and Definition: Uncovering the com- munity structure of real-world networks regularly yields in- teresting empirical insights about complex systems, and also has promising applications in the electronic networks that surround us. At the same time, community detection is a challenging graph mining and algorithm engineering prob- lem, partly because of the imprecise concept of community structure in networks. Most formalizations (e.g. modularity, conductance) lead to NP -hard optimization problems, but efficient heuristics have been created and successfully applied, and continue to be developed. In this work we consider the problem of quickly finding high-quality communities around given seed nodes, particularly in large complex networks. We refer to this task as selective community detection (SCD, also called local community detection or seed set expansion) to distinguish it from global community detection. In the global scenario, the graph is considered as a whole and an assignment of each node to a community is sought. In contrast, selective community detection begins with a small set of seed nodes as input and finds appropriate communities containing them, obtained by searching the network locally around the seed nodes. Such a targeted approach provides potential for speedup when solving the problem globally is not necessary or infeasible. Some SCD algorithms also enable us to find query-specific communities, assuming that the best community for seed node s 1 is different from the one for a nearby seed node s 2 . SCD methods are especially applicable when we lack global knowledge of the network, e.g. in scenarios where the network structure is discovered on-the-fly. While it may seem that the difference between the global and selective scenario lies only in the amount of work performed, we show that the methods needed are somewhat different. Given these differ- ences, applications are as numerous as for global community detection, and include e.g. finding functional complexes for a given protein in protein interaction networks [27]. The existing literature defines the task in slightly different ways, leading to a diverse set of SCD algorithms. We therefore begin by specifying the task as follows: Given a graph G =(V,E) and a set of seed nodes S V , return an assignment of each seed s S to a community C V so that s is contained in C and all communities are pairwise disjoint. Within this definition, the following aspects need clarification: We discuss in Sec. II when a subset of nodes is considered a good community. The output is an assignment of seed node to community. Depending on the algorithm, the communities may overlap or coincide. We do not require the seed nodes to be structurally close to each other – the considered algorithms can handle both related and unrelated seeds appropriately. Methods: While a variety of methods have been pro- posed, we observe a lack of conclusive experimental studies demonstrating their practical relevance. With our comparative study, we aim to close this gap. We are also the first to target large complex networks in the order of 10 5 to 10 6 of edges, requiring scalable algorithms and implementations. After a review of previous work on the subject (Sec. III), algorithmic approaches are compared and classified. We identify one widely used approach which we call Greedy Community Expansion (GCE, Sec. V-A). With our generic implementa- tion of GCE, we evaluate existing objective functions and optimizations. In our experimental comparison we include a reimplementation of the random-walk based PageRank- Nibble [1], representing an important class of approaches to the problem. Furthermore, as our main algorithmic innova- tion, we adapt a global algorithm to the selective scenario (Sec. V-B): A modification of the density-based SCAN algo- rithm yields our variant selSCAN. The algorithm is generic with respect to a node distance measure, and we propose