Recovering Internet Service Sessions from Operating System Failures Florin Sultan † , Aniruddha Bohra † , Pascal Gallard ∓ , Iulian Neamtiu ‡ , Stephen Smaldone † , Yufei Pan † , and Liviu Iftode † † Department of Computer Science Rutgers University, Piscataway, NJ 08854-8019 {sultan, bohra, smaldone, yufeipan, iftode}@cs.rutgers.edu ∓ IRISA / INRIA Rennes Campus Universitaire de Beaulieu, 35042 RENNES Cedex - France Pascal.Gallard@irisa.fr ‡ Department of Computer Science University of Maryland, College Park, MD 20742 neamtiu@cs.umd.edu Abstract Operating system hangs, crashes, deadlocks or panics are system failures that cause loss of active client sessions in an Internet ser- vice. We describe a system that detects such failures and recovers service sessions in clusters of Internet servers. The core of our system is Backdoors, a novel system architecture that enables re- covery of light-weight state associated with client service sessions present in the memory of a server, even when its OS is no longer available. We have built a Backdoors prototype using commodity components and describe our experience with the system in recov- ering service sessions from multiple node failures in a complex multi-tier auction service. 1 Introduction The growth of the Internet has led to critical Internet ser- vices like e-commerce, online auctioning and banking, etc. that run on complex, multi-tier architectures built with com- modity machines and operating systems. The stateful na- ture and exactly-once semantics of these services makes them sensitive to failures of the server machines. A crash or hang of the operating system of an Internet server leads to loss of the active client sessions serviced by that node. In many cases, however, the state of these sessions may still be present in the memory of the failed machine. In this article, we describe a lazy recovery approach that exploits the hardware and software redundancy of Internet services installations to reuse the state of active client ses- sions after an OS failure. Our system is based on Backdoors (BD), a novel system architecture that uses commodity pro- grammable NICs, firmware and OS extensions for remote access to light-weight application and OS state in the mem- ory of a machine without relying on its OS or processors. Using BD, machines in an Internet server cluster can co- operatively observe each other’s health, detect failures and take over client sessions from failed nodes. We have im- plemented a BD prototype and conducted experiments with RUBiS, a cluster-based multi-tier Internet auction service modeled after e-Bay. The system could failover all service sessions from failed nodes in both front-end and middle tier in under 25 ms, while preserving correct service semantics. We next present the basis of our approach, then describe the BD architecture and the OS extensions for monitoring and recovery of service sessions from OS failures. We present our Backdoors prototype and experimental results. 2 Motivation and Approach Today’s Internet services are supported by server machines organized in clustered multi-tier architectures, for reasons of availability and modularity. In a multi-tier architec- ture, multiple nodes perform processing for a client session, starting from a front-end that carries out stateful commu- nication with the client over TCP/IP, going through one or more tiers that implement application logic, and terminating at a database server that manages persistent data. Machines in all tiers run commodity, general-purpose operating sys- tems that do not have support for tolerating failures caused by bugs in the OS code or OS misconfiguration. An OS failure is harmful because it renders a whole sys- tem unusable to application programs, which depend on core OS services (memory allocation, process management, I/O). To recover, an OS reboot is sufficient if the service is not critical and/or the node is stateless. Moreover, if the service is idempotent (which means that a request sent mul- tiple times has the same outcome), then clients can recover from the failure by simply re-issuing their requests. There are at least two problems with the reboot approach: (i) a reboot is destructive to currently executing transac- tions, forcing the clients to re-issue them, and (ii) a reboot is disruptive, thus incurring a cost (in terms of money) to every client and to the service provider. While most appli- cations and their clients can tolerate the side-effects of a re- boot, for the critical, transaction-oriented services more and more prevalent in the Internet such an approach may not be acceptable. First, depending on load balancing and admis- sion policies, there is no guarantee that the clients would be re-admitted to be able to resume their sessions. Second, the service guarantees may include uninterrupted delivery, at least as much as the internetwork permits it. For these rea- sons, we contend that current Internet service architectures lack support for salvaging ongoing stateful client sessions in case of failure of the underlying OS. 1