Effective and Efficient Community Search over Large Heterogeneous Information Networks Yixiang Fang, Yixing Yang, Wenjie Zhang, Xuemin Lin, Xin Cao University of New South Wales, Australia yixiang.fang@unsw.edu.au, yixing.yang@unsw.edu.au wenjie.zhang@unsw.edu.au, lxue@cse.unsw.edu.au, xin.cao@unsw.edu.au ABSTRACT Recently, the topic of community search (CS) has gained plenty of attention. Given a query vertex, CS looks for a dense subgraph that contains it. Existing studies mainly focus on homogeneous graphs in which vertices are of the same type, and cannot be directly ap- plied to heterogeneous information networks (HINs) that consist of multi-typed, interconnected objects, such as the bibliographic net- works and knowledge graphs. In this paper, we study the problem of community search over large HINs; that is, given a query vertex q, find a community from an HIN containing q, in which all the vertices are with the same type of q and have close relationships. To model the relationship between two vertices of the same type, we adopt the well-known concept of meta-path, which is a se- quence of relations defined between different types of vertices. We then measure the cohesiveness of the community by extending the classic minimum degree metric with a meta-path. We further pro- pose efficient query algorithms for finding communities using these cohesiveness metrics. We have performed extensive experiments on five real large HINs, and the results show that the proposed so- lutions are effective for searching communities. Moreover, they are much faster than the baseline solutions. PVLDB Reference Format: Yixiang Fang, Yixing Yang, Wenjie Zhang, Xuemin Lin, Xin Cao. Effective and Efficient Community Search over Large Heterogeneous Information Networks. PVLDB, 13(6): 854-867, 2020. DOI: https://doi.org/10.14778/3380750.3380756 1. INTRODUCTION Heterogeneous information networks (HINs) are networks with multiple typed objects and multiple typed links denoting different semantic relations. These graph data sources are prevalent in vari- ous domains, including bibliographic information networks, social media, and knowledge graphs. Figure 1(a) illustrates an HIN of the DBLP network, which describes the relationship among entities of different types (i.e., author, paper, venue, and topic). In specific, it consists of six authors (i.e., a1, ··· , a6), six papers (i.e., p1, ··· , p6), one venue (i.e., v1), and two topics (i.e., t1 and t2). The directed lines denote their semantic relationship. For example, the authors a1 and a2 have written a paper p1, which mentions the topic t1, published in the venue v1. This work is licensed under the Creative Commons Attribution- NonCommercial-NoDerivatives 4.0 International License. To view a copy of this license, visit http://creativecommons.org/licenses/by-nc-nd/4.0/. For any use beyond those covered by this license, obtain permission by emailing info@vldb.org. Copyright is held by the owner/author(s). Publication rights licensed to the VLDB Endowment. Proceedings of the VLDB Endowment, Vol. 13, No. 6 ISSN 2150-8097. DOI: https://doi.org/10.14778/3380750.3380756 a 1 a 2 a 3 a 4 a 5 a 6 t 1 p 1 p 2 p 3 p 4 p 5 p 6 t 2 v 1 A P T V write mention pubIn (a) An HIN (b) Schema Figure 1: An example HIN with the DBLP network schema. In this paper, we study the problem of C ommunity S earch over H INs (or CSH problem). To the best of our knowledge, this paper is the first work about CS over HINs. Given an HIN G and a query vertex q G, our goal is to find a community, or a set of vertices, from G containing q, in which all the vertices are with the same type of q and they are closely related. Particularly, the community satisfies the meta-path-based cohesiveness (i.e., its vertices are in- tensively connected by instances of a specific meta-path). The con- cept of meta-path has been extensively studied [62]; it is a sequence of vertex types and edge types between two given vertex types. In Figure 2(a), a meta-path P1, defined on authors (A) and papers (P), describes two authors with co-authorship. In Figure 1(a), the au- thors {a1,a2,a3,a4} form a cohesive community, in which each pair of authors can be connected by a path instance of P1. Table 1: Works on network community retrieval. Network Type Community Detection (CD) Community Search (CS) Homogeneous [2, 14, 31, 39, 40, 49] [15, 16, 26, 35, 37, 60] [9, 23, 24, 36, 41, 42] Heterogeneous [59, 61, 63–65, 81] CSH (This work) Prior works. Existing works on network community retrieval can be roughly classified into community detection (CD) and com- munity search (CS). Some representative works are summarized in Table 1. Generally, CD algorithms aim to identify all communi- ties for a graph [31, 49, 63–65, 81]. These studies are not “query- based”, i.e., they are not customized for a query request (e.g., a user-specified query vertex). Moreover, for large graphs, they often take a long time to detect all the communities, so they are not suit- able for online queries. To address these issues, the query-based CS approaches (e.g., [15, 26, 35, 60]) have been recently studied. However, previous CS approaches mainly focus on homogeneous networks where all the vertices are of the same type, while in HINs, both vertices and edges are multiply typed and they carry different semantic meanings, so it does not make sense to mix up them for performing CS using existing approaches. CSH problem. In this paper, we focus on searching commu- nities in HINs, in which vertices are with a specific type (e.g., 854