A Multi-level Elasticity Framework for Distributed Data Stream Processing Matteo Nardelli , Gabriele Russo Russo, Valeria Cardellini, and Francesco Lo Presti Department of Civil Engineering and Computer Science Engineering University of Rome Tor Vergata, Italy {nardelli,russo.russo,cardellini}@ing.uniroma2.it, lopresti@info.uniroma2.it Abstract Data Stream Processing (DSP) applications should be capa- ble to efficiently process high-velocity continuous data streams by elas- tically scaling the parallelism degree of their operators so to deal with high variability in the workload. Moreover, to efficiently use computing resources, modern DSP frameworks should seamlessly support infrastruc- ture elasticity, which allows to exploit resources available on-demand in geo-distributed Cloud and Fog systems. In this paper we propose E2DF, a framework to autonomously control the multi-level elasticity of DSP applications and the underlying computing infrastructure. E2DF revolves around a hierarchical approach, with two control layers that work at dif- ferent granularity and time scale. At the lower level, fully decentralized Operator and Region managers control the reconfiguration of distributed DSP operators and resources. At the higher level, centralized managers oversee the overall application and infrastructure adaptation. We have integrated the proposed solution into Apache Storm, relying on a previ- ous extension we developed, and conducted an experimental evaluation. It shows that, even with simple control policies, E2DF can improve re- source utilization without application performance degradation. Keywords: Data Stream Processing, Elasticity, Hierarchical Control 1 Introduction Exploiting on-the-fly computation, Data Stream Processing (DSP) applications can elaborate unbounded data flows so to extract high-value information as soon as new data are available. A DSP application is represented as a directed (acyclic) graph, with data sources, operators, and final consumers as vertices, and streams as edges. Importantly, these applications are usually long running and often subject to strict latency requirements that should be met in face of variable and high data volumes to process. To deal with operator overloading, a commonly adopted stream processing optimization is data parallelism, which consists in scaling-out or scaling-in the number of parallel instances for the operators, so that each instance can process a subset of the incoming data flow in parallel [7].