Cost-Sensitive Fault Remediation for Autonomic Computing Michael L. Littman and Thu Nguyen and Haym Hirsh Department of Computer Science Rutgers University, Piscataway, NJ Eitan M. Fenson and Richard Howard PnP Networks, Inc., Los Altos, CA Abstract We introduce a formal model of cost-sensitive fault remediation, derive an exact algorithm for solving the special case of deterministic observations, and demonstrate it on two example problems. This ef- fort is part of two self-healing software projects that attempt to use collected data for better decision making in emerging autonomic systems. 1 Introduction Motivated by our ongoing project on self-healing in auto- nomic systems, specifically diagnosis and repair, we intro- duce a formalism for cost-sensitive fault remediation (CSFR). In CSFR, a decision maker is responsible for repairing a sys- tem when it breaks down. To narrow down the source of the fault, the decision maker can perform a test at some cost, and to repair the fault it can carry out a remedial action. A re- medial action incurs a cost and either restores the system to proper functioning or fails. In either case, the system informs the decision maker of the outcome. The decision maker seeks a minimum cost policy for remediating the fault. CSFR bears many similarities to cost-sensitive classifica- tion (CSC, see Turney 1995, Greiner et al. 1996). Like stan- dard classification, CSC is concerned with categorizing un- known instances. However, CSC can be viewed as a kind of sequential decision problem, in which the decision maker interacts with a system by requesting attribute values. The decision maker’s performance is the total of the individual action costs plus a charge for any misclassifications. Diag- nosis problems include hidden information—the identity of the fault state—making them a kind of partially observable Markov decision process (POMDP, Kaelbling et al. 1998). POMDPs can be very notoriously difficult to solve, although there is reason to believe that CSCs result in relatively benign POMDP instances (Zubek and Dietterich 2002, Guo 2002). The main difference between CSC and CSFR, and the main novelty of our work, is that a decision maker in a CSFR model can use feedback on the success or failure of a remediation ac- tion to attempt an alternate repair. In CSC, classification ac- tions end episodes, whereas in CSFR, episodes continue until the fault is repaired. Although a relatively small change, we feel it adds significantly to the degree of autonomy supported by the model. In contrast to general POMDPs, faults in a CSFR remain constant until repaired. This simplifies the construction of policies compared to the general case and may enable ef- ficient approximations. Therefore, the CSFR model lies at an intermediate point on the spectrum from the simple CSC model to the general POMDP model. 2 Formal Problem Definition A cost-sensitive classification problem can be described for- mally by a set of classes C, a prior probability distribution Pr(c) over the elements c ∈ C, a function representing mis- classification costs m(c, c ′ ) for incorrectly classifying an in- stance of class c ′ as a member of class c, a set of tests T ,a cost function j (t, c) giving the cost of executing test t ∈ T in class c ∈ C, and a naive Bayes model b(t, c) providing the conditional probability of observing the outcomes 0 or 1 for test t ∈ T given that the true class is c. Many related for- malisms are possible; we describe this one because it strikes a balance between power and simplicity. The decisions made by a cost-sensitive classifier map a partial assignment of the values 0 and 1 to the tests in T to an unobserved test in T or an announcement of a class c ∈ C. The CSFR model adds a few wrinkles to the CSC model. Instead of being forced to classify instances, a CSFR learner can select from among a set of remedial actions R. The mis- classification costs m are replaced by costs m(r, c) for taking remedial action r ∈ R when the fault class is c ∈ C. In addi- tion, the goal function G(r, c) returns 1 if executing remedial action r ∈ R in class c ∈ C results in repairing the fault and zero otherwise. We assume that for every c ∈ C there is some r ∈ R such that G(r, c)=1 (every fault state can be fixed). In our preliminary work, we have assumed determin- istic observations, b(t, c) ∈{0, 1} for all t ∈ T and c ∈ C, which is a substantial oversimplification. Even with this re- striction, however, stochastic observations can be simulated by increasing the number of fault states in the model. 3 Optimal planning in CSFR To provide a first optimal planning algorithm for the CSFR model, we took advantage of the fact that test outcomes were deterministic in our example problems. This means that there is a single set of test outcomes associated with each fault state. This fact simplifies computations. Specifically, at the begin- ning of the remediation process, the probability of a fault c is simply its prior. As observations are made, some faults are