Big Data Framework Interference In Restricted Private Cloud Settings Stratos Dimopoulos, Chandra Krintz, Rich Wolski Dept. of Computer Science, Univ. of California, Santa Barbara {stratos, ckrintz, rich}@cs.ucsb.edu - Contact: stratos@cs.ucsb.edu Abstract—In this paper, we characterize the behavior of “big” and “fast” data analysis frameworks, in multi-tenant, shared settings for which computing resources (CPU and memory) are limited, an increasingly common scenario used to increase utilization and lower cost. We study how popular analytics frameworks behave and interfere with each other under such constraints. We empirically evaluate Hadoop, Spark, and Storm multi-tenant workloads managed by Mesos. Our results show that in constrained environments, there is significant performance interference that manifests in failed fair sharing, performance variability, and deadlock of resources. Keywords—Big Data Infrastructures and Frameworks, Private Cloud Multi-tenancy, Performance Interference I. I NTRODUCTION Recent technological advances have spurred production and collection of vast amounts of data about individuals, systems, and the environment. As a result, there is significant demand by software engineers, data scientists, and analysts with a variety of backgrounds and expertise, for extracting actionable insights from this data. Such data has the potential for facil- itating beneficial decision support for nearly every aspect of our society and economy, including social networking, health care, business operations, the automotive industry, agriculture, Information Technology, education, and many others. To service this need, a number of open source technologies have emerged that make effective, large-scale data analytics accessible to the masses. These include “big data” and “fast data” analysis systems such as Hadoop, Spark, and Storm from the Apache foundation, which are used by analysts to implement a variety of applications for query support, data mining, machine learning, real-time stream analysis, statistical analysis, and image processing [3, 13]. As complex software systems, with many installation, configuration, and tuning parameters, these frameworks are often deployed under the control of a distributed resource management system [16, 4] to decouple resource management from job scheduling and monitoring, and to facilitate resource sharing between multiple frameworks. Each of these analytics frameworks tends to work best (e.g. most scalable, with the lowest turn-around time, etc.) for different classes of applications, data sets, and data types. For this reason, users are increasingly tempted to use multiple frameworks, each implementing a different aspect of their analysis needs. This new form of multi-tenancy (i.e. multi- analytics) gives users the most choice in terms of extracting potential insights, enables them to fully utilize their compute resources and, when using public clouds, manage their fee-for- use monetary costs. Multi-analytics frameworks have also become part of the software infrastructure available in many private data centers and, as such, must function when deployed on a private cloud. With private clouds, resources are restricted by physical limitations. As a result, these technologies are commonly employed in shared settings in which more resources (CPU, memory, local disk) cannot simply be added on-demand in exchange for an additional charge (as they can in a public cloud setting). Because of this trend, in this paper, we investigate and characterize the performance and behavior of big/fast data systems in shared (multi-tenant), moderately resource con- strained, private cloud settings. While these technologies are typically designed for very large scale deployments such as those maintained by Google, Facebook, and Twitter they are also common and useful at smaller scales [1, 11, 10]. We empirically evaluate the use of Hadoop, Spark, and Storm frameworks in combination, with Mesos [4] to mediate resource demands and to manage sharing across these big data tenants. Our goal is to understand how these frameworks interfere with each other in terms of performance when under resource pressure, and how Mesos behaves and achieves fair sharing [2] when when demand for resources exceeds resource availability. From our experiments and analyses, we find that even though Spark outperforms Hadoop when executed in isolation for a set of popular benchmarks, in a multi-tenant system, their performance varies significantly depending on their respective scheduling policies and the timing of Mesos resource offers. Moreover, for some combinations of frameworks, Mesos is unable to provide fair sharing of resources and/or avoid dead- locks. In addition, we quantify the framework startup overhead and the degree to which it affects short-running jobs. II. BACKGROUND In private cloud settings, where users must contend for a fixed set of data center resources, users commonly employ the same resources to execute multiple analytics systems to make the most of the limited set of resources to which they have been granted access. To understand how these frameworks interfere in such settings, we investigate the use of Mesos to manage them and to facilitate fair sharing. Mesos is a cluster manager that can support a variety of distributed systems including Hadoop, Spark, Storm, Kafka, and others [4]. The goal of our work is to investigate the performance implications associated with Mesos management of multi-tenancy for medium and small scale data analytics on private clouds.