On-The-Fly Capacity Planning Nick Mitchell Peter F. Sweeney IBM T.J. Watson Research Center {nickm,pfs}@us.ibm.com Abstract When resolving performance problems, a simple histogram of hot call stacks does not cut it, especially given the highly fluid nature of modern deployments. Why bother tuning, when adding a few CPUs via the management console will quickly resolve the problem? The findings of these tools are also presented without any sense of context: e.g. string con- version may be expensive, but only matters if it contributes greatly to the response time of user logins. Historically, these concerns have been the purview of capacity planning. The power of planners lies in their ability to weigh demand versus capacity, and to do so in terms of the important units of work in the application (such as user logins). Unfortunately, they rely on measurements of rates and latencies, and both quantities are difficult to obtain. Even if possible, when all is said and done, these planners only relate to the code as a black-box: but, why bother adding CPUs, when easy code changes will fix the problem? We present a way to do planning on-the-fly: with a few call stack samples taken from an already-running system, we predict the benefit of a proposed tuning plan. We accom- plish this by simulating the effect of a tuning action upon execution speed and the way it shifts resource demand. To identify existing problems, we show how to generate tun- ing actions automatically, guided by the desire to maximize speedup without needless expense, and that these generated plans may span resource and code changes. We show that it is possible to infer everything needed from these samples alone: levels of resource demand and the units of work in the application. We evaluate our planner on a suite of mi- crobenchmarks and a suite of 15,000 data sets that come from real applications running in the wild. Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from permissions@acm.org. OOPSLA ’13, October 29–31, 2013, Indianapolis, Indiana, USA. Copyright c 2013 ACM 978-1-4503-2374-1/13/10. . . $15.00. http://dx.doi.org/10.1145/nnnnnnn.nnnnnnn 1. Introduction With a bit of planning, the arduous task of optimization can lead to large improvements in maintenance costs and perfor- mance. We have seen many situations where a few straight- forward changes were enough to tip an application, from the slow death of severe resource contention, to the smooth flow of work through the system. Adding processors to a database machine, splitting the execution of a single-process program into several processes, or updating code to use a concurrent data structure — these changes can yield big reductions in the resources necessary to support anticipated load at desired costs. This paper presents a system that guides the selection and parametrization of tuning actions. Teams need guidance because, with every turn of the tun- ing crank, they are again faced with reasoning through how their changes altered the landscape of resource constraints. Did this latest change remove the lock contention bottle- neck? If so, then why is performance still bad? Did all of our tuning efforts simply expose another, heretofore latent, bottleneck somewhere else? Latent Bottlenecks, Zero-sum Games, and Head Fakes System tuning is often cursed with a richness of possibilities. The best course of action is often not obvious, because the most prevalent activity may not be what you need to fix. For example, if demand for a critical section is high, hun- dreds of threads may sit idle, waiting for access to the guard- ing monitor. If the machine’s CPU is already saturated, then eliminating the lock, at least as an isolated tuning action, may be a futile endeavor. Doing so will only shift resource demand, away from the lock and to an already-saturated CPU. The saturated CPU resource is a latent bottleneck, at least with respect to the prima facie problem, that of hun- dreds of threads backed up on a lock. This scenario is also an example of a zero-sum game: those threads already consuming CPU will execute more slowly, due to increased contention for processors, and those that were previously waiting on the lock can now complete more quickly — in equal measure. Even in the absence of these confounding problems, other kinds of “head fakes” can occur. For example, consider a case with two program tasks A and B competing for a saturated pool of processors. If you are dissatisfied with