Finding a “Kneedle” in a Haystack: Detecting Knee Points in System Behavior Ville Satop¨ a¨ a † , Jeannie Albrecht † , David Irwin ‡ , and Barath Raghavan § † Williams College, Williamstown, MA ‡ University of Massachusetts Amherst, Amherst, MA § International Computer Science Institute, Berkeley, CA Abstract—Computer systems often reach a point at which the relative cost to increase some tunable parameter is no longer worth the corresponding performance benefit. These “knees” typ- ically represent beneficial points that system designers have long selected to best balance inherent trade-offs. While prior work largely uses ad hoc, system-specific approaches to detect knees, we present Kneedle, a general approach to online and offline knee detection that is applicable to a wide range of systems. We define a knee formally for continuous functions using the mathematical concept of curvature and compare our definition against alternatives. We then evaluate Kneedle’s accuracy against existing algorithms on both synthetic and real data sets, and evaluate its performance in two different applications. I. I NTRODUCTION Selecting the “right” operating point for a given system is often thought of as an art form, since the direct and indirect costs and benefits of changing different system parameters are difficult or even impossible to quantify. For example, an important operating point in a large MapReduce job occurs when the job should no longer wait for “slow” tasks to finish, but instead speculatively re-execute work on other nodes in hopes of finishing the job sooner [1]. Since MapReduce’s goal is to finish all tasks as fast as possible, it must decide when the cost, in terms of a job’s running time and cluster utilization, is worth the corresponding performance benefit, in terms of task completion percentage. Congestion-responsive network protocols face a related challenge when setting a sending rate: a protocol must decide a rate that maximizes performance without exceeding its fair share and causing congestion. In prior work, the issue has frequently been couched as identifying one or more “knees”—operating points, based on recent trends, where the perceived cost to alter a system param- eter is no longer worth the expected performance benefit. For MapReduce, triggering speculative execution after observing a knee in the task completion percentage ensures that the system re-executes tasks that are significantly slower than other similar tasks that have finished execution. In the case of a network protocol, successive increases to the sending rate should cease if delay signals congestion by increasing steeply, forming a knee. However, while the problem of knee detection—finding “good” operating points in system behavior—seems straightforward, to the best of our knowledge there exists neither an accepted definition of a knee nor a general systematic approach for detecting one. Numerous researchers in widely disparate areas frequently encounter knee detection problems similar to those we de- scribe [1], [2], [3], [4], [5]. In these systems, researchers either use ad hoc or system-specific approaches to detect knees, or defer the problem to future work. While a finely- crafted system-specific approach will perform better than a general knee detection approach, a designer may not take the time to design one. Thus, our aim is not to improve or optimize a specific system or protocol, but to provide system designers a general tool for improving the parts of their system they generally do not take the time to optimize. In network protocol and system design, rules-of-thumb often serve researchers and operators well in the absence of an optimal solution. We believe that a tool for knee detection adds to their problem solving arsenal. Our hypothesis is that a knee detection algorithm that does not require tuning for a specific system or operational characteristics is applicable in a wide range of settings where developers do not take the time to design, test, and optimize a system-specific algorithm. II. DEFINING AND DETECTING KNEES While the notion of a knee is well-known, we are not aware of a broadly accepted definition in prior literature. The confusion stems from the fact that researchers, in many cases unknowingly, use knees as a substitite for a more comprehensive cost-benefit analysis that is either difficult or impossible to perform. Performing a direct cost-benefit analysis is often complex, since it is inherently system-, platform-, and workload-specific. Further, many systems are not predictable due to volatile operating conditions. For example, unpredictable failure rates in large clusters, which may change over time, are the root cause of stragglers in MapReduce jobs [1]. Likewise, since multiple flows share net- work links in the Internet, network protocols cannot predict in advance the rapidly changing level of TCP-friendly bandwidth available, but must instead continuously adapt to the indirect signals of packet loss and delay [6]. In lieu of a complex system-specific analysis, operators tend to select operating points, or knees, that are “good enough” by observing where performance improvements start to level off as a function of one or more tunable system parameters. Note that we focus on knee detection for complex systems that change their behavior according to volatile, and potentially unpredictable, operating conditions, and not for simple systems that permit standard closed-form models, e.g., M/M/1 queues [7].