ICE : An Integrated Configuration Engine for Interference Mitigation in Cloud Services Amiya K. Maji, Subrata Mitra, Saurabh Bagchi Purdue University, West Lafayette, IN Email: {amaji, mitra4, sbagchi}@purdue.edu Abstract—Performance degradation due to imperfect isola- tion of hardware resources such as cache, network, and I/O has been a frequent occurrence in public cloud platforms. A web server that is suffering from performance interference degrades interactive user experience and results in lost revenues. Existing work on interference mitigation tries to address this problem by intrusive changes to the hypervisor, e.g., using intelligent schedulers or live migration, many of which are available only to infrastructure providers and not end consumers. In this paper, we present a framework for administering web server clusters where effects of interference can be reduced by intelligent reconfiguration. Our controller, ICE, improves web server performance during interference by performing two-fold autonomous reconfigurations. First, it reconfigures the load balancer at the ingress point of the server cluster and thus reduces load on the impacted server. ICE then reconfigures the middleware at the impacted server to reduce its load even further. We implement and evaluate ICE on CloudSuite, a popular web application benchmark, and with two popular load balancers - HAProxy and LVS. Our experiments in a private cloud testbed show that ICE can improve median response time of web servers by upto 94% compared to a statically configured server cluster. ICE also outperforms an adaptive load balancer (using least connection scheduling) by upto 39%. Keywords-interference; cloud performance; load-balancer; dynamic configuration; I. I NTRODUCTION Performance issues in web service applications are no- toriously hard to detect and debug. In many cases, these performance issues arise due to incorrect configurations or incorrect programs. Web servers running in virtualized environments also suffer from issues that are specific to cloud, such as, interference [1], [2] or incorrect resource provisioning [3]. Among these, performance interference and its more visible counterpart performance variability cause significant concerns among IT administrators [4]. Interference also poses a significant threat to the usability of Internet-enabled devices that rely on hard latency bounds on server response (imagine the suspense if Siri took minutes to answer your questions!). Existing research shows that interference is a frequent occurrence in large scale data centers [1], [5]. Therefore, web services hosted in the cloud must be aware of such issues and adapt when needed. Interference happens because of sharing of low level hardware resources such as cache, memory bandwidth, net- work etc. Partitioning these resources is practically infeasible without incurring high degrees of overhead (in terms of compute, memory, or even reduced utilization). Existing solutions primarily try to solve the problem from the point of view of a cloud operator. The core techniques used by these solutions include a combination of one or more of the following: a) Scheduling, b) Live migration, c) Resource containment. Research on novel scheduling policies look at the problem at two abstraction levels. Cluster sched- ulers (consolidation managers) try to optimally place VMs on physical machines such that there is minimal resource contention among VMs on the same physical machine [6]. Novel hypervisor schedulers [7] try to schedule VM threads so that only non-contending threads run in parallel. Live migration involves moving a VM from a busy physical machine to a free machine when interference is detected [8]. Resource containment is generally applicable to containers such as LXC, where the CPU cycles allocated to batch jobs is reduced during interference [1], [9]. Note that all these approaches require access to the hypervisor (or kernel in case of LXC), which is beyond the scope of a cloud consumer. Prior work indicate that, despite having (arguably the best of) schedulers, the public cloud service, Amazon Web Service, shows significant amount of interference [2]. We therefore need to find practical solutions that do not require modification of the hypervisor. One existing solution that looks at this problem from a cloud consumer’s point of view is IC 2 [2]. IC 2 mitigates interference by reconfiguring web server parameters in the presence of inter- ference [2]. The parameters considered are MaxClients (MXC) and KeepaliveTimeout (KAT) in Apache and pm.max_children (PHP) in Php-fpm engine. The au- thors showed that they could recapture lost response time by upto 30% in Amazon’s EC2. However, the drawback of this approach is that web server reconfiguration usually has high overhead (the web server need to spawn or kill some of the worker threads). Moreover, IC 2 alone cannot improve response time much lower without drastically degrading throughput. We note that the key goal in IC 2 is to reduce the load on the web server (WS) during periods of interference. We also observe that implementing admission control at the gateway (load balancer) is a direct way of reducing load to the affected web server. An out-of-box load balancer is agnostic of interference and therefore treat the WS equally. We aspire to make this context aware, and evaluate how much performance gain can be achieved. Solution Approach. The proposed solution, called ICE,