Proactive Auto-scaling of Resources for Stream Processing Engines in the Cloud Tarek M. Ahmed Farhana H. Zulkernine James R. Cordy School of Computing, Queen’s University Kingston, ON, Canada {tahmed,farhana,cordy}@cs.queensu.ca ABSTRACT Large scale applications nowadays continuously generate ma- ssive amounts of data at high speed. Stream processing en- gines (SPEs) such as Apache Storm and Flink are becoming increasingly popular because they provide reliable platforms to process such fast data streams in real time. Despite previous research in the ﬁeld of auto-scaling of re- sources, current SPEs, whether open source such as Apache Storm, or commercial such as streaming components in IBM Infosphere and Microsoft Azure, lack the ability to automat- ically grow and shrink to meet the needs of streaming data applications. Moreover, previous research on auto-scaling focuses on techniques for scaling resources reactively, which can delay the scaling decision unacceptably for time sensi- tive stream applications. To the best of our knowledge, there has been no or limited research using machine learning tech- niques to proactively predict future bottlenecks based on the data ﬂow characteristics of the data stream workload. In this position paper, we present our vision of a three- stage framework to auto-scale resources for SPEs in the cloud. In the ﬁrst stage, the workload model is created using data ﬂow characteristics. The second stage uses the output of the workload model to predict future bottlenecks. Finally, the third stage makes the scaling decision for the resources. We begin with a literature review on the auto-scaling of popular SPEs such as Apache Storm. Keywords Streaming data, auto-scaling, elasticity, machine-learning 1. INTRODUCTION Stream Processing Engines (SPEs) are frameworks that can reliably process and query stream data at high vol- ume and high speed. SPEs, such as Apache Storm [3] and Flink [1], are becoming increasingly popular with the emer- gence of new data sources that can produce massive amounts Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for proﬁt or commercial advantage and that copies bear this notice and the full cita- tion on the ﬁrst page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or re- publish, to post on servers or to redistribute to lists, requires prior speciﬁc permission and/or a fee. Request permissions from permissions@acm.org. c  2016 ACM. ISBN X-XXXXX-XX-X/XX/XX. DOI: http://dx.doi.org/10.1145/0000000.0000000 of data in short periods of time. Examples of such sources include social networks such as Facebook and Twitter, and networks of smart devices in Internet of Things (IoT) [11]. Cisco [5] expects that about 50 billion devices will be con- nected to the internet by 2020, generating massive amounts of fast streaming data. Modern organizations require real-time analysis of their high speed streaming data, and use SPEs to provide timely feedback and decision-making. Because of its cost eﬀective pay-as-you-go model, organizations increasingly choose to host their systems including SPEs in the cloud. To beneﬁt from the pay-as-you-go model, a cloud ser- vice should be able to optimize the usage of resources and minimize latency. Major cloud vendors have well estab- lished auto-scaling techniques to handle ﬁxed, predictable workloads such as database queries. Streaming data on the other hand is dynamic, unbounded and unpredictable, and traditional auto-scaling techniques are not adequate. New auto-scaling techniques are, therefore, required to analyze the data ﬂow characteristics and use the knowledge hiding within this data to make more reliable and adaptive scaling decisions. In this position paper we present our vision of a novel framework to auto-scale cloud resources for streaming data in popular SPEs such as Apache Storm. Our vision explores two fundamental ideas. First, we examine the straming data ﬂow characteristics such as speed and acceleration. For ex- ample, the speed of a stock market stream is directly aﬀected by the occurrence of a major event such as a natural disaster. By analyzing the data stream and measuring its speed and acceleration, we can expect the upcoming stream volume, and further predict the required cloud resources. Second, we explore the use of streaming machine learning techniques to classify workloads and predict future bottle- necks. Our motivation is the fact that the data ﬂow charac- teristics of the streaming data can be thought of as another streaming data source that can be analyzed using an SPE. While auto-scaling of resources is not a new topic in cloud research, little has been targeted at popular and practical SPEs such as Apache Storm. In this work we speciﬁcally aim to address this gap, using the data ﬂow characteristics of streaming data and machine learning techniques to proac- tively predict scaling of resources. 2. RELATED WORK Auto-scaling of cloud resources is not a new topic in the literature. Eﬀorts have been made to achieve auto-scaling