MOHCS: Towards Mining Overlapping Highly Connected Subgraphs* Xiahong Lin School of Computer Science and Technology Xidian University Xi’an, China xhlin@sjtu.edu.cn Lin Gao† School of Computer Science and Technology Xidian University Xi’an, China lgao@mail.xidian.edu.cn Kefei Chen Department of Computer Science and Engineering Shanghai Jiaotong University Shanghai, China kfchen@sjtu.edu.cn David K. Y. Chiu Department of Computing and Information Science University of Guelph Guelph, N1G 2W1, Canada dchiu@cis.uoguelph.ca Abstract—Many networks in real-life typically contain parts in which some nodes are more highly connected to each other than the other nodes of the network. The collection of such nodes are usually called clusters, communities, cohesive groups or modules. In graph terminology, it is called highly connected graph. In this paper, we first prove some properties related to highly connected graph. Based on these properties, we then redefine the highly connected subgraph which results in an algorithm that determines whether a given graph is highly connected in linear time. Then we present a computationally efficient algorithm, called MOHCS, for mining overlapping highly connected subgraphs. We have evaluated experimentally the performance of MOHCS using real and synthetic data sets from computer-generated graph and yeast protein network. Our results show that MOHCS is effective and reliable in finding overlapping highly connected subgraphs. Keywords-component; Highly connected subgraph, clustering algorithms, minimum cut, minimum degree I. INTRODUCTION In a graph modeling a network, such as biological network [1], information network [2] or social network [3], a highly connected subgraph always corresponds to a cohesive set of interconnected vertices which is meaningful. For example, a dense co-expression network may represent a tight co-expression cluster [4]. The definitions of highly connected graph may vary in different works. We define a highly connected graph (or simply dense graph) as a graph whose minimum cut is no less than half of its vertex set size (and the formal definition can be found in [5]). Due to its wide application, identifying these a priori unknown building blocks is crucial to the understanding of the structural and functional properties of networks. Researchers have addressed various problem settings and have proposed numerous algorithms to achieve their goals in the past. Our review only focuses on the algorithms that are most related to our work. Among those that are most related to our work, [6] provides a definition of highly connected subgraph that is valid and useful in practice. There, the HCS algorithm is one of the most well-known clustering algorithms and has been widely used in various domains such as gene expression analysis [7] and functional module discovery [5, 8-10]. It recursively partitions the current graph into two subgraphs by removing the minimum cut until the graph is highly connected [5]. However, HCS has some shortcomings. First, HCS cannot identify overlapping highly connected subgraphs because of its nature of graph-partitioning [5]. Second, when applying the algorithm repeatedly to a large and sparse graph, HCS often cuts off one vertex in each iteration, thus having time complexity of ( ) 2 3 log OV E V V + [5]. Third, the minimum cut algorithm is a critical step used in HCS. However, when applying HCS to a graph with numerous edges closed to quadratic, the fastest deterministic minimum cut algorithm [11] has time complexity of ( ) 3 OV . In [5], Hu et al. proposed an algorithm called MODES, combining HCS with normalized cut, and designed a procedure to identify overlapping highly connected subgraphs. Furthermore, to mine highly connected subgraphs more effectively, several authors introduce greedy vertex deletion algorithm based on the observation that in order to produce a highly connected subgraph, the low degree vertices can be disregarded intuitively. For example Asahiro et al. [12] proposed the following greedy algorithm to find a k -vertex subgraph with the maximum weight: repeatedly remove a vertex with the minimum weighted-degree in the currently remaining graph, until exactly k vertices are left. However, these greedy algorithms can not be used directly to our problem because of the differences in the definition and other problem settings which we will explain below. Our motivation is to find a more efficient algorithm for mining overlapping highly connected subgraphs by reconsidering the properties of highly connected subgraph. The contributions of our work are follows: • We give several properties and consider them in a new definition of highly connected subgraph. * Supported by National Natural Science Foundation of China (No. 60574039) and the Project Sponsored by the Scientific Research Foundation for the Returned Overseas Chinese Scholars, State Education Ministry. † To whom correspondence should be addressed. E-mail: lgao@mail.xidian.edu.cn