BFilter – A XML Message Filtering and Matching Approach in Publish/Subscribe Systems Liang Dai Chung-Horng Lung, Shikharesh Majumdar School of Computer Science Department of Systems and Computer Engineering Carleton University, Ottawa, Ontario, Canada Carleton University, Ottawa, Ontario, Canada liang_dai@yahoo.com {chlung, majumdar}@sce.carleton.ca Abstract - In publish/subscribe systems, XML message filtering performed at application layer is an important operation for XML message multicast. As a specific case of content-based multicast in application layer, XML message multicast depends on the data filtering and matching processes and the forwarding and routing schemes. As the XML data emerges in transition, XML message filtering and matching becomes more and more desirable. BFilter, proposed in this paper, conducts the XML message filtering and matching by leveraging branch points in both the XML document and user query. It evaluates user queries that use backward matching branch points to delay further matching processes until branch points match in the XML document and user query. In this way, XML message filtering can be performed more efficiently as the probability of mismatching is reduced. A number of experiments have been conducted and the results demonstrate that BFilter has better performance than the well-known YFilter for complex queries. Keywords - XML; XML message filtering and matching; pub/sub systems 1. INTRODUCTION In publish/subscribe (pub/sub) systems or Web services, application layer multicast is widely used for data dissemination to subscribers. In pub/sub systems, a subscriber registers a subscription to the pub/sub service and receives published messages that match the subscription. Intuitively, the source (publishers) can allow their subscribers to retain whatever they want, and send all the data to all subscribers. This approach is definitely not efficient because there are too many duplicated data packets. Generally speaking, there are two ways to carry out multicast in the context of pub/sub systems [2,4,7,8,9,11, 13,17,22,23]. The first is to find the subscriber by using the subscription information, and then send appropriate data to subscribers. Data matching can be performed either at the source or at some centralized brokers. The second method is to perform data matching on the fly. In this way, the source simply pushes the data into the network that has a multicast tree composed of routers or brokers. The routers or brokers on the tree have filters to dispatch proper subsets of data to their children. The children in turn perform data matching and dispatching and forward the matched data to their children. This continues until the filtered data reaches the subscribers. The first approach described above may use keyword-based multicast [2,11,15,16,18,19,21] or distributed hash table-based multicast. Distributed hash table-based multicast uses hash functions to assign keys to subscribers by using their subscriptions [5]. These methods are efficient in terms of delivery speed. However, the keyword-based approach is less expressive because the subscriptions contain only keywords. The distributed hash table approach is not content-aware. In these methods, data matching is based on keywords but not the content. The second approach delivers data according to the content. The subscription description is used to perform the matching. The subscription can be presented either in an n- tuple containing n information spaces, or in XPath expressions [1,6,13,14]. An XPath expression is used for addressing portions of a XML file. XPath is more expressive than n-tuple. A XML file is a tree-based structure for describing information. As a XML file is structured, it naturally applies filters in the hierarchy to perform data matching and delivery. XML-based multicast can properly match and deliver messages to subscribers. However, because it is difficult to index and identify the elements in the XML file, the filtering process in each node is time consuming. Hence, the performance of XML-based multicast depends heavily on the approach used to process the XML message. Several approaches for XML filtering have been reported in the literature, see Section 2 for details. One common limitation of those approaches for complex queries that have nested paths is that complex queries have to be decomposed into sub- queries and a post-processing task is needed. As a result, the filtering process becomes inefficient. This paper proposes a novel XML message filtering algorithm—BFilter. (B represents branch points.) BFilter realizes the tree structure in both XML documents and user queries with nested paths. It conducts the XML message filtering and matching process by identifying branch points in both XML documents and user queries. The evaluation of user queries uses backward matching branch points to delay further matching, so that the probability of a mismatch is reduced and XML message filtering can be performed more efficiently. The rest of the paper is organized as follows. Section 2 presents the background. Section 3 discusses the backward matching branch point algorithm. Section 4 demonstrates the some experimental results. Finally, section 5 is the summary. 2. BACKGROUND AND RELATED WORK There are two important operations performed in a pub/sub system: XML message filtering and multicast. This paper focuses on techniques for XML message filtering. XFilter [1] is based on deterministic finite automata, which stores user queries and handles each query individually. It is capable of handling XPath relationship notations, such as ancestor/descendant (represented by ‘//’ in XPath) as well as 978-1-4244-5637-6/10/$26.00 ©2010 IEEE This full text paper was peer reviewed at the direction of IEEE Communications Society subject matter experts for publication in the IEEE Globecom 2010 proceedings.