2016 IEEE International Conference on Big Data (Big Data) 978-1-4673-9005-7/16/$31.00 ©2016 IEEE 410 Real Time Processing of Streaming and Static Information Christoforos Svingos 1, , Theofilos Mailis 1 , Herald Kllapi 1,2 , Lefteris Stamatogiannakis 1 , Yannis Kotidis 3 , Yannis Ioannidis 1 {csvingos, tmailis, herald, estama, yannis}@di.uoa.gr, kotidis@aueb.gr 1 Dept. of Informatics and Telecomunications, University of Athens, Greece. 2 currently at Google. 3 Dept. of Informatics, Athens University of Economics and Business, Greece. AbstractBig Data applications require real-time processing of complex computations on streaming and static information. Applications such as the diagnosis of power generating turbines require the integration of high velocity streaming and large volume of static data from multiple sources. In this paper we study various optimisations related to efficiently processing of streaming and static information. We introduce novel indexing structures for stream processing, a query-planner component that decides when their creation is beneficial, and we examine precomputed summarisations on archived measurements to accelerate streaming and static information processing. To put our ideas into practise, we have developed EXASTREAM, a data stream management system that is scalable, has declarative semantics, supports user defined functions, and allows efficient execution of complex analytical queries on streaming and static data. Our work is accompanied by an empirical evaluation of our optimisation techniques. Keywords-Stream Processing, SQL, Static Data, Performance I. I NTRODUCTION Emerging Big Data applications require real-time pro- cessing of complex computations on streaming and static information. The latter is a challenging task since it involves the integration of high velocity streaming and large volume of static data from multiple sources, on many concurrent continuous queries that need to be executed. A typical scenario described in [1] requires monitoring and diagnosing of power-generating turbines. In the de- scribed scenario, several service centres are dedicated to diagnosing by utilizing data from more than 100, 000 ther- mocouple sensors installed in 950 power generating turbines located across the globe. One typical task of such a centre is to detect in real-time potential faults of a turbine caused by, e.g., an undesirable pattern in temperature’s behaviour within various components of the turbine. This task requires to extract, aggregate, and correlate (i) streaming data produced by up to 2, 000 sensors installed in different parts of the turbine, (ii) static data about the turbine’s structure, (iii) and historical operational data of each sensor stored in multiple datasources. This need has triggered the design of scalable approaches that provide low latency answering to queries on high- * This research has been partially supported by the EU project Optique (FP7-IP-318338). velocity live streams and high-volume static data sources. In this paper we study several novel optimisation techniques for efficiently processing analytical queries on streaming & static information. In particular: (i) we introduce novel in-memory indexing structures and algorithms dedicated to accelerating stream-processing; (ii) we propose the adaptive stream indexing technique that is responsible for creating on the fly the appropriate indexing structures that will accelerate execution of live-stream operations. To put our ideas into practice, we have developed the EXASTREAM Data Stream Management Systems (DSMS), an experimental DSMS that fuses streaming operators to the SQLite database engine. EXASTREAM has several sig- nificant features such as: (i) scalability: the ability to run our system in a distributed environment and its capacity to easily add and remove queries without disrupting existing query execution; (ii) declarative semantics: our system provides for a declarative language, extending the SQL syntax and semantics for querying live streams and relations; (iii) user defined functions: our system natively supports user defined functions with arbitrary user code; (iv) stream and static data integration: based on its architecture and implementa- tion, our system natively supports streaming and static data integration. It should be noted that the optimisations we have proposed are general optimisations that can be adopted by other stream processing systems as well. In our experimental evaluation we study the effect of the proposed optimisations in a cloud deployment of EX- ASTREAM on up to 128 nodes using real sensor data from power generating turbines. Our findings demonstrate the effectiveness of our techniques in processing up to 1 thousand live stream queries and performing correlation analysis between live and archived stream measurements in real time. II. SYSTEM OVERVIEW The EXASTREAM Data Stream Management System (DSMS) has been designed for efficiently processing on both static and streaming information. It is embedded in EXAREME (https://www.exareme.org), a system for elastic large-scale dataflow processing on the cloud [2], [3] that has been publicly available as an open source project under the MIT License. EXASTREAM was implemented as a key