Finding a “Kneedle” in a Haystack: Detecting Knee Points in System Behavior Ville Satop¨ a¨ a † , Jeannie Albrecht † , David Irwin ‡ , and Barath Raghavan § † Williams College, Williamstown, MA ‡ University of Massachusetts Amherst, Amherst, MA § International Computer Science Institute, Berkeley, CA Abstract—Computer systems often reach a point at which the relative cost to increase some tunable parameter is no longer worth the corresponding performance beneﬁt. These “knees” typ- ically represent beneﬁcial points that system designers have long selected to best balance inherent trade-offs. While prior work largely uses ad hoc, system-speciﬁc approaches to detect knees, we present Kneedle, a general approach to online and ofﬂine knee detection that is applicable to a wide range of systems. We deﬁne a knee formally for continuous functions using the mathematical concept of curvature and compare our deﬁnition against alternatives. We then evaluate Kneedle’s accuracy against existing algorithms on both synthetic and real data sets, and evaluate its performance in two different applications. I. I NTRODUCTION Selecting the “right” operating point for a given system is often thought of as an art form, since the direct and indirect costs and beneﬁts of changing different system parameters are difﬁcult or even impossible to quantify. For example, an important operating point in a large MapReduce job occurs when the job should no longer wait for “slow” tasks to ﬁnish, but instead speculatively re-execute work on other nodes in hopes of ﬁnishing the job sooner [1]. Since MapReduce’s goal is to ﬁnish all tasks as fast as possible, it must decide when the cost, in terms of a job’s running time and cluster utilization, is worth the corresponding performance beneﬁt, in terms of task completion percentage. Congestion-responsive network protocols face a related challenge when setting a sending rate: a protocol must decide a rate that maximizes performance without exceeding its fair share and causing congestion. In prior work, the issue has frequently been couched as identifying one or more “knees”—operating points, based on recent trends, where the perceived cost to alter a system param- eter is no longer worth the expected performance beneﬁt. For MapReduce, triggering speculative execution after observing a knee in the task completion percentage ensures that the system re-executes tasks that are signiﬁcantly slower than other similar tasks that have ﬁnished execution. In the case of a network protocol, successive increases to the sending rate should cease if delay signals congestion by increasing steeply, forming a knee. However, while the problem of knee detection—ﬁnding “good” operating points in system behavior—seems straightforward, to the best of our knowledge there exists neither an accepted deﬁnition of a knee nor a general systematic approach for detecting one. Numerous researchers in widely disparate areas frequently encounter knee detection problems similar to those we de- scribe [1], [2], [3], [4], [5]. In these systems, researchers either use ad hoc or system-speciﬁc approaches to detect knees, or defer the problem to future work. While a ﬁnely- crafted system-speciﬁc approach will perform better than a general knee detection approach, a designer may not take the time to design one. Thus, our aim is not to improve or optimize a speciﬁc system or protocol, but to provide system designers a general tool for improving the parts of their system they generally do not take the time to optimize. In network protocol and system design, rules-of-thumb often serve researchers and operators well in the absence of an optimal solution. We believe that a tool for knee detection adds to their problem solving arsenal. Our hypothesis is that a knee detection algorithm that does not require tuning for a speciﬁc system or operational characteristics is applicable in a wide range of settings where developers do not take the time to design, test, and optimize a system-speciﬁc algorithm. II. DEFINING AND DETECTING KNEES While the notion of a knee is well-known, we are not aware of a broadly accepted deﬁnition in prior literature. The confusion stems from the fact that researchers, in many cases unknowingly, use knees as a substitite for a more comprehensive cost-beneﬁt analysis that is either difﬁcult or impossible to perform. Performing a direct cost-beneﬁt analysis is often complex, since it is inherently system-, platform-, and workload-speciﬁc. Further, many systems are not predictable due to volatile operating conditions. For example, unpredictable failure rates in large clusters, which may change over time, are the root cause of stragglers in MapReduce jobs [1]. Likewise, since multiple ﬂows share net- work links in the Internet, network protocols cannot predict in advance the rapidly changing level of TCP-friendly bandwidth available, but must instead continuously adapt to the indirect signals of packet loss and delay [6]. In lieu of a complex system-speciﬁc analysis, operators tend to select operating points, or knees, that are “good enough” by observing where performance improvements start to level off as a function of one or more tunable system parameters. Note that we focus on knee detection for complex systems that change their behavior according to volatile, and potentially unpredictable, operating conditions, and not for simple systems that permit standard closed-form models, e.g., M/M/1 queues [7].