Caching Techniques for XML Message Filtering
Yang Cao
∗
, Shikharesh Majumdar
†
and Chung-Horng Lung
†
∗
School of Computer Science
†
Department of Systems and Computer Engineering
Carleton University, Ottawa, Canada
Email: ycao5@scs.carleton.ca, {majumdar, chlung}@sce.carleton.ca
Abstract—An XML publish/subscribe system is based on
filtering XML message streams for a large number of sub-
scriptions expressed in XPath. A major issue on an XML-
based publish/subscribe system is its performance. As the
number of XML documents and XPath-based subscriptions
increases in the system, to provide XML filtering efficiently
becomes a challenging problem. Hence, there is an urgent
need for optimization techniques to meet this challenge. There
are many existing approaches on designing efficient XML
filtering engine. Most existing research efforts focus on efficient
filtering algorithms for achieving a high system performance
or supporting more complex XPath syntax. Each proposed
scheme has its advantages and limitations. Not much research,
however, has considered using caching in the context of XML
filtering. In this paper, we propose two caching schemes to be
used in conjunction with an XML filtering engine. First, we
present a complete message caching algorithm that is a strict
caching policy to reduce the computation cost that accrues
from multiple filtering of the same messages, by reusing results
of previously processed messages. Second, we investigate a
structure-based caching method that is an approximate caching
policy for messages sharing the same structure. Performance
evaluation for synthetic data and real data both show that
complete message caching and structure-based caching schemes
are able to achieve significantly better filtering performance
(up to 80% for both caching schemes for the message streams
experimented with).
Keywords-publish/subscribe; XML; caching; performance
evaluation.
I. I NTRODUCTION
A content-based publish/subscribe (pub/sub) system is
a paradigm for building distributed applications using an
asynchronous communication mechanism. Since XML is
widely used for interoperability, systems using the XML
pub/sub scheme have started to receive more attention.
XML can be described either with a Document Type
Definition (DTD) or with an XML schema. XML has a
fixed syntax and an unlimited vocabulary. It allows two
distributed application components to communicate by ex-
changing XML messages across different networks. The
tag names and attribute names are referred as structural
information, and the text and attribute values as value
information. XPath is a query language used to locate data
in an XML document [3]. An XPath expression consists of
a sequence of location steps. Each step contains an axis,
a node test or predicates. An axis describes relationships
Figure 1. Publish/subscribe systems
between elements in a message. The child axis (‘/’) means
an element is a child element of the current element. The
descendant operator (‘//’) represents that an element is a
descendant element of the current element. The wildcard
(‘*’) operator matches any element name. Predicates (‘[ ]’)
are used to select text data, attribute value or position of an
element. The structural attribute of XML and the flexibility
of XPath give rise complexity in the XML filtering problem.
Besides content matching, structure matching is required to
select message nodes satisfied by a filtering engine. In the
context of our discussion, messages are encoded in XML
and user subscriptions are specified in XPath expressions,
denoted as XP
/,//,∗,[]
[3]. We use query and subscription
interchangeable in this paper. XSLT [4] is a language for
transforming the structure and content of an XML document.
It is an ideal tool for converting data from one set of XML
documents to generate another set of XML documents.
An XML pub/sub system matches XML messages against
a large number of user subscriptions and delivers messages
to those identified subscribers across different networks
using a routing scheme at the application layer. A general
architecture of an XML-based pub/sub is depicted in Fig. 1,
which includes publishers, subscribers and brokers. A broker
performs the function of a router in a publish/subscribe
system. Each publisher edge node maintains a spanning tree
by broadcasting announcement messages and each broker is
assumed to know its neighbors and the best path leading to a
publisher edge node [11]. For example, brokers in Fig. 1 are
315 978-1-4244-5736-6/09/$26.00 ©2009 IEEE