The Fast and the Fair: A Fault-Injection-Driven Comparison of Restart Oracles for Reliable Web Services Philipp Reinecke Humboldt-Universit¨ at zu Berlin Institut f¨ ur Informatik Berlin, Germany preineck@informatik.hu-berlin.de Aad P. A. van Moorsel University of Newcastle upon Tyne School of Computing Science Newcastle upon Tyne, United Kingdom aad.vanmoorsel@ncl.ac.uk Katinka Wolter Humboldt-Universit¨ at zu Berlin Institut f¨ ur Informatik Berlin, Germany wolter@informatik.hu-berlin.de Abstract Web Services are typically deployed in Internet or In- tranet environments, making message transfers susceptible to a wide variety of network, protocol and system failures. To mitigate these problems, reliable messaging solutions for web services have been proposed that retry messages sus- pected to be lost. It is of interest to evaluate the perfor- mance of such reliable messaging solutions, and in this pa- per we therefore utilise a newly developed fault-injection environment for the analysis of time-out strategies for the Web Services Reliable Messaging standard. We compare or- acles that determine retransmission times with respect to the tradeoff between two metrics: the effective transfer time and the overhead in terms of additional message transmis- sions. Our fault-injection environment allows faults to be in- voked in the IP layer, representing a variety of failure situ- ations in the underlying system. The study presented in this paper includes two adaptive oracles, which set the length of the retransmission interval based on round trip time measurements, and two static oracles. The study consid- ers both HTTP and Mail as SOAP transports. We conclude that adaptive oracles may significantly outperform static or- acles when the underlying system exhibits more complex be- haviour. 1. Introduction With the continuing acceptance of Web Services tech- nologies as a means of integrating applications, the depend- ability of the Web Services stack becomes increasingly im- portant. Several attempts of defining an appropriate reliabil- ity standard have converged in Web Services Reliable Mes- saging (WSRM), which provides a framework to deliver messages ‘reliably between distributed applications in the presence of software component, system, or network fail- ures’ [2]. Of the four delivery assurances specified in [2], both ‘at least once’ and ‘exactly once’ necessitate the re- transmission of lost messages. While the standard describes positive and negative acknowledgements to determine the message transmission status, it does not provide details on the preferred waiting time (for a positive acknowledgement) until re-sending a message. Although exponential backoff is mentioned as one way to adjust the retransmission interval, the issue is effectively left open. In this paper we experimentally investigate the influence of time-out strategies on the performance of, and overhead introduced by, WSRM. In particular, we analyse represen- tative algorithms for four classes of restart 1 oracles (as ex- plained in section 3). We will see that the more complex the behaviour of the underlying network and system, the more it pays off to utilise strategies that adapt the time-out value based on observed data. There are two important aspects to the evaluation of WSRM time-out strategies we will address throughout the paper: the optimal strategy as decidable at the level of WSRM and the interaction with reliability mechanisms at lower layers, in particular TCP. Especially the latter is ex- tremely difficult to track, and we believe that experimen- 1 Throughout the paper we use restart, retry, resend and retransmit in- terchangeably.