Exploring the Design Space for Optimizations with Apache Aurora and Mesos Renan DelValle, Gourav Rattihalli, Angel Beltre, Madhusudhan Govindaraju, and Michael J. Lewis Department of Computer Science, State University of New York (SUNY) at Binghamton {rdelval1, grattih1, abeltre1, mgovinda, mlewis}@binghamton.edu Abstract—Cloud infrastructures increasingly include a hetero- geneous mix of components in terms of performance, power, and energy usage. As the size of cloud infrastructures grows, power consumption becomes a significant constraint. We use Apache Mesos and Apache Aurora, which provide massive scalability to web-scale applications, to demonstrate how a policy driven approach involving bin-packing workloads according to their power profiles, instead of the default allocation by Mesos and Aurora, can effectively reduce the peak-power and energy usage as well as the node utilization, when workloads are co-scheduled. Our experimental results show reductions of 11% in peak power, 86% for total energy usage, and an increase in utilization of 148% for memory and 8% CPU for the different policies. I. I NTRODUCTION Large-scale datacenters (DCs) execute thousands of diverse applications each day. Conflicts between co-located workloads and the difficulty of matching applications to appropriate nodes can degrade performance and violate workload Quality of Service (QoS) requirements [1]. Apache Mesos [2] enables dynamic partitioning and removes the need to isolate frame- works on separate resources. Apache Aurora [3] works in concert with Mesos, as a service scheduler. Many big com- panies with large-scale applications, such as Twitter, Apple, and Bloomberg, us these technologies to provide scalability and stability of cloud infrastructures. Mesos and Aurora interact to allocate resources (memory, cores, and storage) to tasks and to cache these resource allocation decisions (described in Section II). The caching mechanism is effective when each task is allocated in isolation, but can have negative consequences when many fine-grained jobs arrive and resource offers are too large. For example, if 10 tasks each requesting one CPU and four GB of memory arrive at Aurora, and Mesos makes a combined resource offer of ten CPUs and 40 GBs of memory, Aurora accepts the offer. However, it accepts the offer of one CPU and four GB for each task one at a time, repeating the resource request and acceptance process for all tasks. While this negotiation phase provides fairness and enables co-scheduling of tasks, it may introduce a delay before the start of each task. As cloud workloads run, they draw power to their host machine(s). The power usage can vary depending on the characteristics of each host, and on the Cloud’s shared software infrastructure. Optimally co-scheduling applications to mini- mize peak-power usage can be reduced to a multi-dimensional bin packing problem, and is therefore NP-Hard. We therefore set out to find heuristics that reduce peak power in a cluster that uses that Mesos and Aurora. Approach. We address the peak-power collision problem using a policy-driven heuristic approach to the multi-dimensional bin packing problem. We use the DaCapo benchmarks [4] as workloads. DaCapo is a set of open source, real-world applications that exercise the various resources within a com- pute node. Our approach characterizes the power use of each benchmark on each node using fine-grained power profiles provided by Intel’s Running Average Power Limit (RAPL) [5] counters via the Linux Powercapping framework [6]. We take the power profiling data for a given benchmark and node, and use that to engineer the job arrival time by potentially delaying it up to 3 seconds. This delay ensures that the power surge for two benchmarks do not happen at the same instance and also influences how Mesos and Aurora allocate resources for each benchmark. We show the effect of two different bin packing policies, one that takes into account local power profile information and one that takes into consideration global power profile data. We evaluate how the staging of tasks to avoid peak power collisions, also influences resource usage, and energy consumption. We make the following contributions in this paper: • We demonstrate how bin-packing a set of tasks can effec- tively reduce the peak-power usage and total energy usage while increasing the node utilization, when workloads are co-scheduled to be run using Mesos and Aurora. • We show how our experimental framework can inform application developers how their applications respond to peak power usage in a heterogeneous cloud environment. • We demonstrate how Apache Mesos and Apache Aurora should be used so that application developers can express what they need from a cluster in terms of peak power use, and not just memory, disk, and CPU specifications. II. BACKGROUND:MESOS AND AURORA Mesos provides scalability and fault-tolerance to massive scale applications. Examples of its use include Apple’s Siri, Bloomberg’s data analytics, Paypal’s continuous integration system, and Verizon Labs [7]. How Apache Mesos Works: Apache Mesos provides a layer of abstraction above the compute resources in data centers and large clusters. Mesos combines cluster resources (CPU, memory, and storage) into a shared pool, and efficiently