PPKWS: An Efficient Framework for Keyword Search on Public-Private Networks Jiaxin Jiang , Xin Huang , Byron Choi , Jianliang Xu , Sourav S Bhowmick , Lyu Xu *Department of Computer Science, Hong Kong Baptist University, Hong Kong School of Computer Science and Engineering, Nanyang Technological University, Singapore {jxjian,xinhuang,bchoi,xujl,cslyuxu}@comp.hkbu.edu.hk, assourav@ntu.edu.sg Abstract—Due to the unstructuredness and the lack of schemas of graphs, such as knowledge graphs, social networks and RDF graphs, keyword search has been proposed for querying such graphs/networks. In many applications (e.g., social networks), users may prefer to hide parts or all of her/his data graphs (e.g., private friendships) from the public. This leads to a recent graph model, namely the public-private network model, in which each user has his/her own network. While there have been studies on public-private network analysis, keyword search on public- private networks has not yet been studied. For example, query answers on private networks and on a combination of private and public networks can be different. In this paper, we propose a new keyword search framework, called public-private keyword search (PPKWS). PPKWS consists of three major steps: partial evaluation, answer refinement, and answer completion. Since there have been plenty of keyword search semantics, we select three representative ones and show that they can be implemented on the model with minor modifications. We propose indexes and optimizations for PPKWS. We have verified through experiments that, on average, the algorithms implemented on top of PPKWS run 113 times faster than the original algorithms directly running on the public network attached to the private network for retrieving answers that spans through them. I. I NTRODUCTION Knowledge graphs, social networks and RDF graphs have a wide variety of emerging applications, including semantic query processing [24], information summarization [21], com- munity search [9], collaboration and activities organization [20] and user-friendly query facilities [22]. Such graphs often lack useful schema information for users to formulate their queries. Keyword search is a fundamental query paradigm that makes querying such data easy. In a nutshell, a user essentially specifies a set of keywords Q on a data graph G as his/her query. Depending on the search semantics, the answer to Q can be subgraphs that either contain the keywords and/or are top-k subgraphs. For instance, Google’s knowledge graph search API 1 facilitates users in finding answers from their knowledge database, and returns the query answers in the form of subtrees. The answers (a) make it easy for users to explore some additional relevant keywords and (b) indicate the relationships of the query keywords. As reported in a recent study [7], users may have private graphs such as private knowledge bases or social networks. For instance, 52.6% of 1.4 million New York City Facebook users hide their friends lists. Such behavior naturally leads to 1 https://developers.google.com/knowledge-graph/ Public graph G G 0 1 G 0 2 G 0 3 G 0 4 Alice Dave Carol Bob Private graphs G 0 i C A fDBg fAIg fDBg fAIg fMLg fCVg Portal nodes A: Alice B: Bob C: Carol Q= fDB, AI, CVg 3. ans. in 1. ans. in G 0 4 : no answer G 0 4 G : Combined graph of Bob D D: Dave A 2. ans. in G: C 2 2 2 1 C D B B B B Notations: : Fig. 1: An example of the public-private graph model (G is a public graph, and G 1 , G 2 , G 3 and G 4 are private graphs) a new graph model, called the public-private graph model [3], [1], and [17]. It consists of a public graph and many private graphs, where the private ones are only accessible to their owners. Generally, each user has his/her own combined graph. This model warrants revisiting the research on keyword search for two reasons. Firstly, the combined graphs can be large. For instance, the latest version of one semantic knowledge base, YAGO, contains 4.5 million entities and 24 million facts. It is not practical to directly apply the existing indexing techniques (e.g., [14] and [10]) to each combined graph for each user. Secondly, there are already several semantics for keyword search. It is desirable to have a unified framework that optimizes their query performance. Example I.1. Consider a public collaboration network G in Fig. 1 (e.g., [11]), where a node is an academic with its labels representing keywords of his/her research interests and an edge is a collaboration in research papers. A professor, Bob, has a private collaboration network G 4 as shown in Fig. 1 (e.g., for grants, conferences and company organizations). G and G 4 are visible to Bob. G 1 , G 2 and G 3 are not, since they are, respectively, owned by “Alice”, “Dave” and “Carol”. G and G 4 are combined by some common nodes (a.k.a. portal nodes, shown as concentric circles in Fig. 1). When Bob proposes a new interdisciplinary project “DB-AI-CV”, he first seeks out his close collaborators (say within 2 hops) from his private network G 4 . The query {“DB”,“AI”,“CV”} on Bob’s network returns “No answer”. The answer from the public graph G alone is a subtree rooted at “Bob” whose leaf vertices are {“Dave”,“Carol”}, but they are not close to each other. From the combined network of G 4 and G, Bob obtains a subtree rooted at “Bob” whose leaf vertices are {“Alice”,“Carol”}, which is a closer collaboration.