Leveraging Approximation to Improve Resource Efﬁciency in the Cloud Neeraj Kulkarni, Feng Qi, Glyﬁna Fernando, and Christina Delimitrou Cornell University {nsf49, fq26, gsf52, delimitrou}@cornell.edu Abstract Although cloud computing has increased in popularity, data- center utilization has remained for the most part low. This is in part due to the interference that comes as a result of appli- cations sharing hardware and software resources. When in- terference occurs, the resources of at least one co-scheduled application need to be reduced forcing it to take a perfor- mance penalty. In current proposals, the penalized appli- cation is typically a low-priority, best-effort workload. Ap- proximate computing applications present an opportunity to improve datacenter efﬁciency without performance degrada- tion, since they can absorb the enforced resource reduction as a loss in output quality. In this paper we present Pliant, a runtime system that improves datacenter utilization by co-scheduling interactive services with approximate computing applications. When the runtime detects QoS violations in the interactive ser- vice, it employs approximation to reduce interference, and absorbs the resource reduction as a loss in output accuracy. 1. Introduction Cloud computing has reached proliferation by offering re- source ﬂexibility and cost efﬁciency [2, 1, 8]. Cost-efﬁciency is achieved through multi-tenancy, i.e., by co-scheduling multiple jobs on the same physical platform. Unfortunately multi-tenancy also leads to unpredictable performance due to interference [12, 4, 13]. When the applications suffer- ing from interference are high priority, interactive services, like websearch and social networking, multi-tenancy is dis- allowed hurting utilization, or - at best - interactive services are co-scheduled with low priority, best-effort applications whose performance can be sacriﬁced [16, 9, 5]. Approxi- mate computing applications offer the potential to break this utilization versus performance trade-off. In this work we present Pliant, a cloud runtime system that achieves both high quality of service (QoS) and high utilization by leveraging the ability of approximate comput- ing applications to tolerate some loss of output quality. Pli- ant enables aggressive co-scheduling of interactive, latency- critical services with - also high priority - approximate com- puting applications. It consists of a lightweight performance monitoring system based on adaptive sampling [15, 10] that continuously checks for QoS violations, and a dynamic com- pilation system that adjusts the level of accuracy the approx- imate computing application can sustain in an online man- ner. When interference surfaces due to resource sharing, Pli- ant employs suitable approximation techniques that allevi- ate contention without penalizing the execution time of the approximate computing application. Speciﬁcally, Pliant de- termines the appropriate approximation technique(s) needed based on the type of interference measured in the system, e.g., memory, compute, network, storage I/O, and incremen- tally increases the degree of approximation until the interac- tive service can once again meet its QoS constraints. We evaluate Pliant with a distributed in-memory low la- tency caching service, memcached [7], and two benchmark suites with applications that can tolerate approximation [3, 17]. In server platforms with 20 physical (40 logical) cores, Pliant enables 90-95% CPU utilization, while ensuring that memcached achieves the same throughput (QPS) and tail la- tency as when run in isolation, and the approximate com- puting applications achieve 18.3% lower execution time on average, with a maximum of 15% loss in accuracy. 2. Pliant Design & Evaluation Pliant consists of three components. First, a lightweight per- formance monitor continuously samples the throughput and end-to-end latency (average and tail) of the interactive ser- vice, and notiﬁes the runtime system in the event of a QoS violation. Second, an interference monitor runs on the server and collects performance counter information that identiﬁes the resource(s) suffering from contention. Third, a runtime system that enforces a degree and method of approxima- tion based on the output of the performance and interfer- ence monitors. The system uses DynamoRIO [6] to switch between the precise and different approximate versions of the approximate computing applications. Figure 1 shows an overview of the runtime system. The interactive service shares physical cores with the approximate computing ap- plications, although an individual hyperthread (or vCPU) is dedicated to a single application, which is common practice in public clouds [1]. A different client machine is used to drive the load of the interactive service. Performance monitor: This module is integrated in our workload generator and runs on the client side to capture apart from processing time, the network latency of the round trip of a request. It relies on adaptive sampling of requests (based on request rate) to maintain monitoring overheads negligible (< 0.01% in throughput and < 0.1% in 99 th per- centile latency) and leverages systemtap to provide a break- down of execution time and identify latency bottlenecks. Dynamic recompilation: Pliant relies on DynamoRIO [6] to switch between the precise and different approximate versions of an application; the trigger for a switch is one of several Linux signals (e.g., SIGTERM, SIGQUIT). We examine the following approximation techniques: