Proactive Auto-scaling of Resources for Stream Processing Engines in the Cloud Tarek M. Ahmed Farhana H. Zulkernine James R. Cordy School of Computing, Queen’s University Kingston, ON, Canada {tahmed,farhana,cordy}@cs.queensu.ca ABSTRACT Large scale applications nowadays continuously generate ma- ssive amounts of data at high speed. Stream processing en- gines (SPEs) such as Apache Storm and Flink are becoming increasingly popular because they provide reliable platforms to process such fast data streams in real time. Despite previous research in the field of auto-scaling of re- sources, current SPEs, whether open source such as Apache Storm, or commercial such as streaming components in IBM Infosphere and Microsoft Azure, lack the ability to automat- ically grow and shrink to meet the needs of streaming data applications. Moreover, previous research on auto-scaling focuses on techniques for scaling resources reactively, which can delay the scaling decision unacceptably for time sensi- tive stream applications. To the best of our knowledge, there has been no or limited research using machine learning tech- niques to proactively predict future bottlenecks based on the data flow characteristics of the data stream workload. In this position paper, we present our vision of a three- stage framework to auto-scale resources for SPEs in the cloud. In the first stage, the workload model is created using data flow characteristics. The second stage uses the output of the workload model to predict future bottlenecks. Finally, the third stage makes the scaling decision for the resources. We begin with a literature review on the auto-scaling of popular SPEs such as Apache Storm. Keywords Streaming data, auto-scaling, elasticity, machine-learning 1. INTRODUCTION Stream Processing Engines (SPEs) are frameworks that can reliably process and query stream data at high vol- ume and high speed. SPEs, such as Apache Storm [3] and Flink [1], are becoming increasingly popular with the emer- gence of new data sources that can produce massive amounts Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full cita- tion on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or re- publish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from permissions@acm.org. c 2016 ACM. ISBN X-XXXXX-XX-X/XX/XX. DOI: http://dx.doi.org/10.1145/0000000.0000000 of data in short periods of time. Examples of such sources include social networks such as Facebook and Twitter, and networks of smart devices in Internet of Things (IoT) [11]. Cisco [5] expects that about 50 billion devices will be con- nected to the internet by 2020, generating massive amounts of fast streaming data. Modern organizations require real-time analysis of their high speed streaming data, and use SPEs to provide timely feedback and decision-making. Because of its cost effective pay-as-you-go model, organizations increasingly choose to host their systems including SPEs in the cloud. To benefit from the pay-as-you-go model, a cloud ser- vice should be able to optimize the usage of resources and minimize latency. Major cloud vendors have well estab- lished auto-scaling techniques to handle fixed, predictable workloads such as database queries. Streaming data on the other hand is dynamic, unbounded and unpredictable, and traditional auto-scaling techniques are not adequate. New auto-scaling techniques are, therefore, required to analyze the data flow characteristics and use the knowledge hiding within this data to make more reliable and adaptive scaling decisions. In this position paper we present our vision of a novel framework to auto-scale cloud resources for streaming data in popular SPEs such as Apache Storm. Our vision explores two fundamental ideas. First, we examine the straming data flow characteristics such as speed and acceleration. For ex- ample, the speed of a stock market stream is directly affected by the occurrence of a major event such as a natural disaster. By analyzing the data stream and measuring its speed and acceleration, we can expect the upcoming stream volume, and further predict the required cloud resources. Second, we explore the use of streaming machine learning techniques to classify workloads and predict future bottle- necks. Our motivation is the fact that the data flow charac- teristics of the streaming data can be thought of as another streaming data source that can be analyzed using an SPE. While auto-scaling of resources is not a new topic in cloud research, little has been targeted at popular and practical SPEs such as Apache Storm. In this work we specifically aim to address this gap, using the data flow characteristics of streaming data and machine learning techniques to proac- tively predict scaling of resources. 2. RELATED WORK Auto-scaling of cloud resources is not a new topic in the literature. Efforts have been made to achieve auto-scaling