Guided Recovery for Web Service Applications Jocelyn Simmonds, Shoham Ben-David, Marsha Chechik Department of Computer Science University of Toronto Toronto, ON M5S 3G4, Canada {jsimmond, shoham, chechik}@cs.toronto.edu ABSTRACT Web service applications are dynamic, highly distributed, and loosely coupled orchestrations of services which are no- toriously difficult to debug. In this paper, we describe a user-guided recovery framework for web services. When be- havioural correctness properties (safety and bounded live- ness) of an application are violated at runtime, we automat- ically propose and rank recovery plans which users can then select for execution. For safety violations, such plans essen- tially involve “going back” – compensating the occurred ac- tions until an alternative behavior of the application is possi- ble. For bounded liveness violations, such plans include both “going back” and “re-planning” – guiding the application to- wards a desired behavior. We report on the implementation and our experience with the recovery system. Keywords Web services, LTS, behavioural properties, runtime moni- toring, planning, SAT solving. 1. INTRODUCTION Recent years have seen the increased reliance on being able to conduct business over the Internet. The Service- Oriented Architecture (SOA) framework is a popular guide- line for building web-based applications. A SOA-based ap- plication is an orchestration of services offered by (possibly third-party) components written in a traditional compiled language such as Java, or in an XML-centric language such as BPEL 1 . Web services are distributed systems, where partners are dynamically discovered and are going on- and off-line as the application runs. Their failures can be caused by bugs in the service orchestration, e.g., due to faulty logic and bad data manipulation, or by problems with hardware, network or system software, or by incorrect invocations of services. 1 http://docs.oasis-open.org/wsbpel/2.0/OS/ wsbpel-v2.0-OS.html Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Copyright 200X ACM X-XXXXX-XX-X/XX/XX ...$10.00. With runtime failures of web services inevitable, infrastruc- tures for running them typically include the ability to de- fine faults and compensatory actions for dealing with excep- tional situations. Specifically, the compensation mechanism is the application-specific way of reversing completed activi- ties. For example, the compensation for booking a car would be to cancel the booking. Existing infrastructures for web services, e.g., the BPEL engine, include mechanisms for fault definition, for specifica- tion of compensation actions, and for dealing with termina- tion. When an error is detected at runtime, they typically try to compensate all completed activities for which com- pensations are defined, with the default compensation being the reversal of the most recently completed action. This approach presents several major problems: (1) The applica- tion is often allowed to continue running until the fault is discovered, thus executing and then compensating for a lot of unnecessary and potentially expensive activities. (2) It is hard to determine, a priori, the state of the application after executing compensation mechanisms. (3) There might be multiple compensations available, based on global infor- mation (i.e., avoid canceling the flight since it has a dollar cost associated with it, and try to cancel the hotel instead), but the automatic application of compensations does not allow the user of such a system to choose between them. This paper describes a user-guided recovery framework for web services, instantiating it on BPEL programs. We con- centrate on behavioural correctness, and specifically, on the correct interaction between service partners. The overview of the approach is given in Fig. 1a. Our approach consists of three phases: Preprocessing, Monitoring and Recovery. It admits the following user guidance: (I) Application devel- opers define a set of behavioral correctness properties that need to be maintained at runtime, as well as compensation costs and idempotent service calls (see Sec. 3.2) (II) (Op- tional) Application users provide criteria for choosing be- tween possible recovery plans, i.e., based on the plan length, compensation cost, etc. (III) Application users manually choose the desired recovery plan among those automatically computed and ranked by our system. We consider behavioral correctness properties to be sce- narios that the system should exhibit and scenarios that the system should not exhibit. For example, consider a simple web-based Trip Advisor System (TAS). In a typical scenario, a customer either chooses to arrive at her destination via a rental car (and thus books it), or via an air/ground trans- portation combination, combining the flight with either a rental car from the airport or a limo. The requirement of the