Scheduling Decisions in Stream Processing on Heterogeneous Clusters Marek Rychl´ y Department of Information Systems Faculty of Information Technology Brno University of Technology Brno, Czech Republic Email: rychly@fit.vutbr.cz Petr ˇ Skoda, Pavel ˇ Smrˇ z Department of Computer Graphics and Multimedia Faculty of Information Technology, Brno University of Technology IT4Innovations Centre of Excellence Brno, Czech Republic Email: {iskoda,smrz}@fit.vutbr.cz Abstract—Stream processing is a paradigm evolving in re- sponse to well-known limitations of widely adopted MapReduce paradigm for big data processing, a hot topic of today’s computer world. Moreover, in the field of computation facilities, hetero- geneity of data processing clusters, intended or unintended, is starting to be relatively common. This paper deals with scheduling problems and decisions in stream processing on heterogeneous clusters. It brings an overview of current state of the art of stream processing on heterogeneous clusters with focus on resource allo- cation and scheduling. Basic scheduling decisions are discussed and demonstrated on naive scheduling of a sample application. The paper presents a proposal of a novel scheduler for stream processing frameworks on heterogeneous clusters, which employs design-time knowledge as well as benchmarking techniques to achieve optimal resource-aware deployment of applications over the clusters and eventually better overall utilization of the cluster. Keywordsscheduling; resource-awareness; benchmarking; heterogeneous clusters; stream processing; Apache Storm. I. I NTRODUCTION As the Internet grows bigger, the amount of data that can be gathered, stored, and processed constantly increases. Traditional approaches to processing of big data, e.g., the data of crawled documents, web request logs, etc., involves mainly batch pro- cessing techniques on very large shared clusters running in parallel across hundreds of commodity hardware nodes. For the static nature of such datasets, the batch processing appears to be a suitable technique, both in terms of data distribution and task scheduling, and distributed batch processing frameworks, e.g., the frameworks that implement the MapReduce programming paradigm [1], have proved to be very popular. However, the traditional approaches developed for the pro- cessing of static datasets cannot provide low latency responses needed for continuous and real-time stream processing when new data is constantly arriving even as the old data is being processed. In the data stream model, some or all of the input data that are to be processed are not available in a static dataset, but rather arrive as one or more continuous data streams [2]. Traditional distributed processing frameworks like MapReduce are not well suited to process data streams due to their batch- orientation. The response times of those systems are typically greater than 30 seconds while real-time processing requires response times in the (sub)seconds range [3]. To address distributed stream processing, several platforms for data or event stream processing systems have been proposed, e.g., S4 and Storm [4], [5]. In this paper, we build upon one of these distributed stream processing platforms, namely Storm. Storm defines distributed processing in terms of streams of data messages flowing from data sources (referred to as spouts) through a directed acyclic graph (referred to as a topology) of interconnected data processors (referred to as bolts). A single Storm topology consists of spouts that inject streams of data into the topology and bolts that process and modify the data. Contrary to the distributed batch processing approach, re- source allocation and scheduling in distributed stream process- ing is much more difficult due to dynamic nature of input data streams. In both cases, the resource allocation deals mainly with a problem of gathering and assigning resources to the different requesters while scheduling cares about which tasks and when to place on which previously obtained resources [6]. In the case of distributed batch processing, both resources allocation and tasks scheduling can be done prior to the process- ing of a batch of jobs based on knowledge of data and tasks for processing and of a distributed environment. Moreover, during batch processing, required resources are often simply allocated statically from the beginning to the end of the processing. In the case of distributed stream processing, which is typically continuous, dynamic nature of input data and unlimited processing time require dynamic allocation of shared resources and real-time scheduling of tasks based on actual intensity of input data flow, actual quality of the data, and actual workload of a distributed environment. For example, resource allocation and task scheduling in Storm involves real-time decision making considering how to replicate bolts and spread them across nodes of a cluster to achieve required scalability and fault tolerance. This paper deals with problems of scheduling in distributed data stream processing on heterogeneous clusters. The paper is organized as follows. In Section II, stream processing on heterogeneous clusters is discussed in detail, with focus on resource allocation and task scheduling, and related work and existing approaches are analysed. In Section III, a use case of distributed stream processing is presented. Section IV deals with scheduling decisions in the use case. Based on the analysis of the scheduling decisions, Section V proposes a concept of a novel scheduling advisor for distributed stream processing on heterogeneous clusters. Since this paper presents an ongoing research, Section VI discusses future work on the scheduling advisor. Finally, Section VII provides conclusions.