LinkSCAN ∗ : Overlapping Community Detection Using the Link-Space Transformation Sungsu Lim †1 , Seungwoo Ryu ‡2 , Sejeong Kwon §3 , Kyomin Jung ¶4 , Jae-Gil Lee †5* † Department of Knowledge Service Engineering, KAIST ‡ Samsung Advanced Institute of Technology, Samsung Electronics § Graduate School of Culture Technology, KAIST ¶ Department of Electrical and Computer Engineering, Seoul National University 1,3,5 {ssungssu, gsj1029, jaegil}@kaist.ac.kr, 2 seungwoo.ryu@samsung.com, 4 kjung@snu.ac.kr Abstract—In this paper, for overlapping community detection, we propose a novel framework of the link-space transformation that transforms a given original graph into a link-space graph. Its unique idea is to consider topological structure and link similarity separately using two distinct types of graphs: the line graph and the original graph. For topological structure, each link of the original graph is mapped to a node of the link-space graph, which enables us to discover overlapping communities using non- overlapping community detection algorithms as in the line graph. For link similarity, it is calculated on the original graph and carried over into the link-space graph, which enables us to keep the original structure on the transformed graph. Thus, our transformation, by combining these two advantages, facilitates overlapping community detection as well as improves the result- ing quality. Based on this framework, we develop the algorithm LinkSCAN that performs structural clustering on the link-space graph. Moreover, we propose the algorithm LinkSCAN * that enhances the efficiency of LinkSCAN by sampling. Extensive experiments were conducted using the LFR benchmark networks as well as some real-world networks. The results show that our algorithms achieve higher accuracy, quality, and coverage than the state-of-the-art algorithms. I. I NTRODUCTION A. Motivation In many real-world social networks such as Facebook and Twitter, individuals can belong to multiple communities, e.g., family, friends, colleagues, and schoolmates [1]. Thus, there have been active discussions on overlapping community de- tection [2], [3], [4], [5], [6], [1]. The existing methods can be roughly classified into two categories depending on the graph element used for community discovery. • Node-based: Each node is directly associated with mul- tiple communities. A popular method is labeling a node x with a set of pairs (c, b), where c is a community identifier and b is a belonging coefficient [6]. A belonging coefficient indicates the strength of x’s membership of the community c. Well-known methods in this category include the label propagation method [6]. • Structure-based: Community discovery is done through a pre-defined structure such as a clique or a link. A node can participate in multiple cliques or links. Thus, even if cliques or links are partitioned into disjoint communities, ∗ Jae-Gil Lee is the corresponding author. participating nodes can belong to multiple communities. Well-known methods in this category include the Clique Percolation Method (CPM) [3], [1] and the link-partition method [2], [4]. In connection with this categorization, we have found out that the existing methods suffer from three common problems. 1) Many highly overlapping nodes: The node-based cate- gory assigns a set of belonging coefficients into a node, where the sum of the coefficients is 1. Thus, as more communities overlap at a node, the value 1 needs to be distributed to more communities, resulting in smaller differences among the coefficients. If the coefficients get close with each other, it would be tricky to distinguish the communities to which the node belongs and those to which the node does not. 2) Incorrect base-structures: The structure-based category first tries to discover base-structures from a graph. This procedure is based on the assumption that the graph consists of many such base-structures. However, in the CPM, the graph may not have many cliques since cliques are very strict structures. Thus, for loosely-connected graphs, many of the nodes are not covered by any community in the CPM. The link-partition method does not suffer from this problem since the link is the basic structure of a graph. 3) Incorrect membership of weak ties: In the structure- based category, it is usually required that every base- structure should belong to at least one community. This requirement is problematic since it may result in overly- overlapping communities. For example, a weak tie [7] had better not be included in any community since it does not represent strong relationship between the nodes. In Figure 1, suppose that Jack and Bob are travel buddies and do not have common friends in a social network. Jack belongs to his workplace community, and Bob to his family community. If the weak tie is assigned to one of the communities, the two communities overly overlap at either Jack or Bob. In fact, it is obvious that the two communities should be separated. Table I compares the popular community detection algo- rithms for these three problems. An ‘X’ mark indicates that 978-1-4799-2555-1/14/$31.00 2014 IEEE ICDE Conference 2014 292