Towards Efficient Dissemination and Filtering of XML Data Streams Kirill Belyaev Computer Science Department Colorado State University Fort Collins, CO 80523-1873 Email: kirill@cs.colostate.edu Indrakshi Ray Computer Science Department Colorado State University Fort Collins, CO 80523-1873 Email: iray@cs.colostate.edu Abstract—The vast amounts of data generated in near real- time due to prolific use of sensors, pervasive usage of mobile Internet, and popularity of social media platforms, necessitates the efficient dissemination of the semi-structured streaming data to the consuming applications. Towards this end, we introduce the subscriber-centric XML filtering approach for seamless and efficient XML stream replication/distribution mechanism. The subscriber-centric filtering architecture can be configured to support different topologies in order to support efficient message filtering for a large number of concurrent subscribers. It allows selective filtering on the various nodes that improves efficiency and provides applications with data on a need-to-know basis. Moreover, it supports interoperability and allows semi-structured streams generated from multiple sources to be filtered. Our XML filtering network consists of decoupled data producers, message transformation agents and XML brokers that can be deployed in conventional data centers as well as in the public cloud environment. We provide detailed performance results of processing filtering queries in several use case scenarios with varying XML message loads and number of nodes involved in the replication/dissemination process. Our results indicate that the subscriber-centric XML filtering architecture is a viable approach for disseminating semi-structured data streams to the various consuming applications. I. I NTRODUCTION With the increase in the usage of mobile devices that are connected to the Internet, consumers are subscribing to various types of applications, such as, Yahoo! Weather, Yahoo! Finance, and Twitter, that require delivery of streaming data in a timely manner. There is a need for gathering semi-structured streaming data from the sources, transforming them to a form that facilitates interoperability, and then replicating/distributing the data stream in an efficient manner to the multifarious applications needing the data for various purposes, such as forwarding the data to the subscribing consumers and/or per- forming complex stream analytics to detect trends or outliers. Publish/subscribe (pub/sub) has been a popular commu- nication paradigm which provides customized notifications to users in a distributed environment [1]. Pub/sub systems are used in geomarketing, traffic and weather alerts, emergency response services, and social networking. These systems are large (e.g. Twitter is estimated to handle over 400 million tweets daily), geographically distributed and largely subscrip- tion based. Subscriptions in such systems, such as deals for local shops and traffic alerts for freeways, involve simple queries and are short-lived. Such pub/sub systems provide very little query support and trade expressiveness for performance. However, their inability to express expressive continuous queries over data streams, possibly in different formats, make them unsuitable for detecting complex events that arise in situation monitoring applications. The majority of modern Internet applications use XML as an inter-application communication exchange format in spite of its heavy network bandwidth utilization. Typically, the applica- tions generate data in XML so that it can be easily distributed to other applications by operational runtime environments [2] [3] [4]. XML-based data dissemination networks are starting to become a reality [4]. Data generated in XML format should be adapted for efficient streaming, filtering and consumption by the subscribing applications. We address this issue by introducing the subscriber-centric XML content filtering service where each XML message generated or received by the application layer is transformed into a dissemination-ready XML message for transport over the network infrastructure. We propose the TeleScope XML filtering broker [5] in this paper to carry out the selective dissemination/replication of XML messages to consuming end-points. Our subscriber- centric broker has the following characteristics. (i) Fast pro- cessing of XML messages under high input stream rates and large number of subscribers – the TeleScope XML filtering broker is written in C that supports very fast message filtering speeds even with a large number of concurrent subscribers. (ii) Content-based XML filtering uses expressive filtering lan- guage – TeleScope introduces an engine with simple yet effi- cient user-friendly content filtering domain specific language parser over XML stream with full support for Boolean logic operators as well as supplemental operators such as network prefix range computing operators. The language allows easy integration with XML consuming applications and does not require the knowledge of complex XPath/XQuery [6] seman- tics, but supports the common stream filtering/dissemination scenarios. (iii) Ability to form the overlay filtering network for XML dissemination – placing of TeleScope nodes in the form of the filtering mesh allows efficient dissemination of XML content to the endpoints. The rest of the paper is organized as follows. Section II gives a detailed overview of the XML stream replication for consuming applications and describes our subscriber-centric filtering architecture. Section III highlights the XML filtering framework. Section IV describes the subscribers management involved in the task of efficient stream dissemination. Section 2015 IEEE International Conference on Computer and Information Technology; Ubiquitous Computing and Communications; Dependable, Autonomic and Secure Computing; Pervasive Intelligence and Computing 978-1-5090-0154-5/15 $31.00 © 2015 IEEE DOI 10.1109/CIT/IUCC/DASC/PICOM.2015.278 1870