Probabilistic Pattern Queries over Complex Probabilistic Graphs Alfredo Cuzzocrea ICAR-CNR and University of Calabria Cosenza, Italy cuzzocrea@si.deis.unical.it Paolo Serafino DEIS Dept. – University of Calabria Cosenza, Italy pserafino@deis.unical.it ABSTRACT This paper introduces probabilistic pattern queries over complex probabilistic graphs, a theoretical graph model proposed by us recently for dealing with complex probabilistic graph data of modern applications characterized by uncertainty and imprecision. Effective algorithms implementing such queries are also provided. 1. INTRODUCTION In [1] we have preliminarily introduced the so-called Complex Probabilistic Graphs (CPG) that are probabilistic graphs [8] capable of capturing linked data structures embedding both complex-modeling (e.g., [7]) and uncertainty and imprecision aspects (e.g., [2,3,4,5]). As we discuss in [1], actual graph-like data models, even with probability constructs, are not prone to capture the model requirements drawn by modern application scenarios such as linked web data (e.g., [3]), sensor networks (e.g., [2]), distributed stream systems (e.g., [6]), and so forth. One of the main novelty due to CPG graphs introduced in [1] consists of the innovative idea of associating Probability Density Function (PDF) [10] to nodes, beyond to simple confidence intervals (plus related probability) [10], like in traditional approaches (e.g., [3]). Likewise classical formulations, even in CPG graphs edges are equipped with existence probabilities in order to model the probability by which an edge can be traversed during query evaluation. A major results of [1] is represented by the proposal for two meaningful classes of graph queries that allow us to extract useful knowledge from CPG graphs in terms of algebra-aware (sub- )graphs. These queries are named Zero-memory Membership Probabilistic Query (ZMPQ) and Non-zero-memory Membership Probabilistic Query (NMPQ), respectively [1], and make use of the well-understood reachability concept of graph models (e.g., [9]), in a probabilistic manner. As we discuss in [1], both ZMPQ and NMPQ queries are suitable to check the probabilistic membership of a given query PDF (which, for instance, may model an event or a sequence of events) to the target CPG graph by inspecting the PDF associated to its nodes, with both the variants of assuming the availability of a memory (for the case of NMPQ queries) or not (for the case of ZMPQ queries). With the aim of extending the research results introduced in [1], in this paper we introduce more two novel classes of queries over CPG graphs that extend the previous ones by focusing on query patterns rather that query PDF, via modeling such patterns by means of (conventional) graphs. These queries are named Zero- memory Graph Pattern Probabilistic Query (ZGPPQ) and Non- zero-memory Graph Pattern Probabilistic Query (NGPPQ), respectively. ZGPPQ and NGPPQ make use of ZMPQ and NMPQ queries as baseline routines. We also provide effective algorithms implementing ZGPPQ and NGPPQ queries, respectively. 2. BASIC REACHABILITY-BASED MEMBERSHIP PROBABILISTIC QUERIES In this Section, we provide the formal definitions of the basic queries of our theoretical framework, i.e. ZMPQ and NMPQ queries. Before introducing such query classes, some background concepts are necessary. Let us focus the attention on these concepts. In the following, given a CPG and two nodes and of , we will be referring to the sequence of nodes as a path from node to node in if the following conditions hold: (i) for ; (ii) for each ; (iii) for ; (iv) ; and (v) . Definition 1 introduces the so-called Path Chain Probability (PCP). Furthermore, given a CPG , we denote as the path in from to having the greatest PCP, and as the PCP of . DEFINITION 1 – PATH CHAIN PROBABILITY (PCP) – Given a CPG and the path , the chain probability PCP of , denoted as , is defined as follows: . Definition 2 introduces the concept of Zero-memory Membership Probabilistic Reachability (ZMPR). Given a CPG and a set of nodes in , the ZMPR nodes, denoted by , are defined as the set of nodes of that satisfy the ZMPR property, fixed a similarity threshold and an existence probability threshold . Intuitively enough, the ZMPR property is an extended probabilistic membership of a set of nodes to a CPG that is characterized by the fact that it is computed (or, equally, verified) under the no-memory assumption. At a more practical level, this means that, during the evaluation for checking the ZMPR property, at the actual node, the check algorithm does not consider the probabilities associated to already-visited nodes, but the probability associated to the actual node only. The ZMPR property is the conceptual basis of the first class of innovative classes of graph queries we introduce in our reachability-based framework, the so-called Zero-memory Membership Probabilistic Query (ZMPQ), which is formally defined by Definition 3. DEFINITION 2 – ZERO-MEMORY MEMBERSHIP PROBABILISTIC REACHABILITY (ZMPR) – Given a CPG , a set of nodes , an input PDF , a similarity threshold and an existence probability threshold , the following definitions hold: (i) the set of one-step ZMPR nodes, denoted by Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. LWDM’12, March 30, 2012, Berlin, Germany. Copyright 2012 ACM 978-1-4503-1143-4/12/03…$10.00.