Efficient Subgraph Search over Large Uncertain Graphs Ye Yuan § Guoren Wang § Haixun Wang Lei Chen § College of Information Science and Engineering, Northeastern University, China Microsoft Research Asia Hong Kong University of Science and Technology, Hong Kong, China State Key Lab of Software Engineering Wuhan University, Wuhan, Hubei, China {yuanye,wanggr}@ise.neu.edu.cn, haixunw@microsoft.com, leichen@cse.ust.hk ABSTRACT Retrieving graphs containing a query graph from a large graph database is a key task in many graph-based applications, includ- ing chemical compounds discovery, protein complex prediction, and structural pattern recognition. However, graph data handled by these applications is often noisy, incomplete, and inaccurate be- cause of the way the data is produced. In this paper,we study sub- graph queries over uncertain graphs. Specifically, we consider the problem of answering threshold-based probabilistic queries over a large uncertain graph database with the possible world seman- tics. We prove that problem is #P-complete, therefore, we adopt a filtering-and-verification strategy to speed up the search. In the filtering phase, we use a probabilistic inverted index, PIndex, based on subgraph features obtained by an optimal feature selection pro- cess. During the verification phase, we develop exact and bound algorithms to validate the remaining candidates. Extensive experi- mental results demonstrate the effectiveness of the proposed algo- rithms. 1. INTRODUCTION In this paper, we study the problem of subgraph matching over large uncertain graphs. A large variety of applications work on graph structured data, and in many cases, the graph data they deal with are uncertain or noisy by nature [1, 4, 5, 8, 14, 24, 21, 20, 31]. For example, in bioinformatics, protein-protein interaction (PPI) networks obtained through experiments are noisy – they may con- tain interactions that do not really exist and at the same time they may miss real interactions [8, 26, 31]. It is thus more natural to represent a PPI network as an uncertain graph where nodes (pro- teins) are connected by uncertain edges associated with numerical values which indicate the possibility of interaction between the pro- teins. As another example, in visual pattern recognition, graphs are used to model visual objects, and since information is incomplete or noisy, such representations are uncertain [4, 21, 24]. It has been shown that methods finding probabilistic matches outperform exact matching algorithms [34] in many aspects. Uncertainty also arises in social networks: links between two persons are often associated with probabilities that represent the uncertainty of the link [20] or Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Articles from this volume were invited to present their results at The 37th International Conference on Very Large Data Bases, August 29th - September 3rd 2011, Seattle, Washington. Proceedings of the VLDB Endowment, Vol. 4, No. 11 Copyright 2011 VLDB Endowment 2150-8097/11/08... $ 10.00. A(0.6) A(0.8) B(0.9) b 1 2 3 4 6 5 1 2 3 a b A(0.5) A(1) B(0.3) A(0.6) A(0.7) B(0.4) 0.9 0.7 0.5 b b b a a a c 0.6 0.8 0.9 0.5 1 0.9 0.2 (001) (002) (q1) (q2) A B a b A a A B Figure 1: Uncertain graph database & Query graphs. the strength of influence a person has over another person in virtual marketing [12]. In the aforementioned applications, graph matching is a typi- cal query for many interesting tasks, such as identifying scenes (graphs) in visual pattern recognition [4, 21], predicting complex biological interactions (graphs) [8, 31], and finding social commu- nities (graphs) [12]. Therefore, it is important to study subgraph matching over large uncertain graphs. 1.1 Probabilistic Subgraph Matching In this paper, we focus on threshold-based probabilistic sub- graph matching (T-PS) over a large set of uncertain graphs. Specif- ically, let D = {g 1 ,g 2 , ..., g n } be a set of uncertain graphs, let q be a query graph, and let ϵ be a probability threshold, a T-PS query retrieves all graphs g D such that the subgraph isomor- phic probability (SIP) between q and g is not smaller than ϵ. We will formally define SIP in Section 2. Example 1. Figure 1.1 shows a database that contains two un- certain graphs (001 and 002) and two query graphs (q 1 and q 2 ). Vertices and edges are labeled (A, B, C, ...; a, b, c, ...), and a real number associated with each vertex and each edge represents the existence probability of the vertex or edge. The first question we must answer is, what constitutes a match in uncertain graphs? To answer this question, we employ the possible world semantics [30, 11], which has been used for modeling query processing over probabilistic databases. A possible world graph (PWG) of is a possible instance of an uncertain graph. It contains a subset of vertices and edges of the uncertain graph, and it has a weight which is the product of the probabilities of all the vertices and edges it has. Then, for a query graph q and an uncertain graph g, the probability that q matches g is the summation of the weights of those PWGs of g that are subgraph-isomorphic to q. Example 2. Figure 1.1 lists all the PWGs of uncertain graph 001 and their weights. Altogether there are 18 PWGs for graph 001, and the sum of all the weights is 1. To decide if q1 matches uncertain graph 001, we first find all of 001’s PWGs that contain q 1 as a subgraph. Note that, “g contains q as a subgraph” means 876