CADET: Debugging and Fixing Misconfigurations using Counterfactual Reasoning Md Shahriar Iqbal ∗ University of South Carolina miqbal@email.sc.edu Rahul Krishna ∗ Columbia University rahul.krishna@columbia.edu Mohammad Ali Javidian Purdue University mjavidia@purdue.edu Baishakhi Ray Columbia University rayb@cs.columbia.edu Pooyan Jamshidi University of South Carolina pjamshid@cse.sc.edu Abstract Modern computing platforms are highly-confgurable with thou- sands of interacting confguration options. However, confguring these systems is challenging and misconfgurations can cause un- expected non-functional faults. This paper proposes CADET (short for Ca usal De bugging T oolkit) that enables users to identify, ex- plain, and fx the root cause of non-functional faults early and in a principled fashion. CADET builds a causal model by observing the performance of the system under diferent confgurations. Then, it uses casual path extraction followed by counterfactual reason- ing over the causal model to (a) identify the root causes of non- functional faults, (b) estimate the efects of various confguration options on the performance objective(s), and (c) prescribe candi- date repairs to the relevant confguration options to fx the non- functional fault. We evaluated CADET on 5 highly-confgurable systems by comparing with state-of-the-art confguration optimiza- tion and ML-based debugging approaches. The experimental results indicate that CADET can fnd efective repairs for faults in multiple non-functional properties with (at most) 13% more accuracy, 32% higher gain, and 13× speed-up than other ML-based performance debugging methods. Compared to multi-objective optimization ap- proaches, CADET can fnd fxes (at most) 8× faster with comparable or better performance gain. Our study of non-functional faults re- ported in NVIDIA’s forum shows that CADET can fnd 14% better repairs than the experts’ advice in less than 30 minutes. 1 Introduction Modern computing systems are highly confgurable and can seam- lessly be deployed on various hardware platforms and under difer- ent environmental settings. The confguration space is combinatori- ally large with 100s if not 1000s of software and hardware confgura- tion options that interact non-trivially with one another [38, 49, 99]. Unfortunately, confguring these systems to achieve specifc goals is challenging and error-prone. Incorrect confguration (misconfguration) elicits unexpected in- teractions between software and hardware resulting non-functional faults, i.e., faults in non-functional system properties such as latency and energy consumption. These non-functional faultsÐunlike reg- ular software bugsÐdo not cause the system to crash or exhibit an obvious misbehavior [75, 82, 94]. Instead, misconfgured sys- tems remain operational while being compromised, resulting severe performance degradation in latency, energy consumption, and/or ∗ Joint First Author GPU Growth Swap Mem. 4Gb 3Gb 2Gb 1Gb 512 Mb GPU Growth Latency Latency GPU Memory Swap Memory Resource Pressure (a) (b) (c) Figure 1: Observational data (in Fig. 1a) (incorrectly) shows that high GPU memory growth leads to high latency. The trend is reversed when the data is segregated by swap memory. heat dissipation [15, 71, 74, 84]. The sheer number of modalities of software deployment is so large that exhaustively testing every conceivable software and hardware confguration is impossible. Consequently, identifying the root cause of non-functional faults is notoriously difcult [35] with as much as 99% of them going unno- ticed or unreported for extended durations [4]. This has tremendous monetary repercussions costing companies worldwide an estimated $5 trillion in 2018 and 2019 [34]. Further, developers on online fo- rums are quite vocal in expressing their dissatisfaction. For example, one developer on NVIDIA’s developer forum bemoans: łI am quite upset with CPU usage on TX2 [8],ž while another complained, łI don’t think it [the performance] is normal and it gets more and more frustrating [7].ž Crucially, these exchanges provoke other unan- swered questions, such as, łwhat would be the efect of changing another confguration ‘X’? [2].ž Therefore, we seek methods that can identify, explain, and fx the root cause of non-functional faults early in a principled fashion. Existing work. Much recent work has focused on confguration optimization, which are approaches aimed at fnding a confgura- tion that optimizes a performance objective [29, 72, 86, 100, 105]. Finding the optimum confguration using push-button optimization approaches is not applicable here because they do not give us any information about the underlying interactions between the faulty confguration options that caused the non-functional fault. This information is sought after by developers seeking to address these non-functional faults [82, 93]. Some previous work has used machine learning-based perfor- mance modeling approaches [36, 86, 87, 96]. These approaches are adept at inferring the correlations between certain confguration arXiv:2010.06061v2 [cs.SE] 8 Mar 2021