Caching Techniques for XML Message Filtering Yang Cao , Shikharesh Majumdar and Chung-Horng Lung School of Computer Science Department of Systems and Computer Engineering Carleton University, Ottawa, Canada Email: ycao5@scs.carleton.ca, {majumdar, chlung}@sce.carleton.ca Abstract—An XML publish/subscribe system is based on filtering XML message streams for a large number of sub- scriptions expressed in XPath. A major issue on an XML- based publish/subscribe system is its performance. As the number of XML documents and XPath-based subscriptions increases in the system, to provide XML filtering efficiently becomes a challenging problem. Hence, there is an urgent need for optimization techniques to meet this challenge. There are many existing approaches on designing efficient XML filtering engine. Most existing research efforts focus on efficient filtering algorithms for achieving a high system performance or supporting more complex XPath syntax. Each proposed scheme has its advantages and limitations. Not much research, however, has considered using caching in the context of XML filtering. In this paper, we propose two caching schemes to be used in conjunction with an XML filtering engine. First, we present a complete message caching algorithm that is a strict caching policy to reduce the computation cost that accrues from multiple filtering of the same messages, by reusing results of previously processed messages. Second, we investigate a structure-based caching method that is an approximate caching policy for messages sharing the same structure. Performance evaluation for synthetic data and real data both show that complete message caching and structure-based caching schemes are able to achieve significantly better filtering performance (up to 80% for both caching schemes for the message streams experimented with). Keywords-publish/subscribe; XML; caching; performance evaluation. I. I NTRODUCTION A content-based publish/subscribe (pub/sub) system is a paradigm for building distributed applications using an asynchronous communication mechanism. Since XML is widely used for interoperability, systems using the XML pub/sub scheme have started to receive more attention. XML can be described either with a Document Type Definition (DTD) or with an XML schema. XML has a fixed syntax and an unlimited vocabulary. It allows two distributed application components to communicate by ex- changing XML messages across different networks. The tag names and attribute names are referred as structural information, and the text and attribute values as value information. XPath is a query language used to locate data in an XML document [3]. An XPath expression consists of a sequence of location steps. Each step contains an axis, a node test or predicates. An axis describes relationships Figure 1. Publish/subscribe systems between elements in a message. The child axis (‘/’) means an element is a child element of the current element. The descendant operator (‘//’) represents that an element is a descendant element of the current element. The wildcard (‘*’) operator matches any element name. Predicates (‘[ ]’) are used to select text data, attribute value or position of an element. The structural attribute of XML and the flexibility of XPath give rise complexity in the XML filtering problem. Besides content matching, structure matching is required to select message nodes satisfied by a filtering engine. In the context of our discussion, messages are encoded in XML and user subscriptions are specified in XPath expressions, denoted as XP /,//,,[] [3]. We use query and subscription interchangeable in this paper. XSLT [4] is a language for transforming the structure and content of an XML document. It is an ideal tool for converting data from one set of XML documents to generate another set of XML documents. An XML pub/sub system matches XML messages against a large number of user subscriptions and delivers messages to those identified subscribers across different networks using a routing scheme at the application layer. A general architecture of an XML-based pub/sub is depicted in Fig. 1, which includes publishers, subscribers and brokers. A broker performs the function of a router in a publish/subscribe system. Each publisher edge node maintains a spanning tree by broadcasting announcement messages and each broker is assumed to know its neighbors and the best path leading to a publisher edge node [11]. For example, brokers in Fig. 1 are 315 978-1-4244-5736-6/09/$26.00 ©2009 IEEE