Fault-tolerance for Stateful Application Servers in the Presence of Advanced Transactions Patterns Huaigu Wu Bettina Kemme School of Computer Science, McGill University, Montreal, Quebec, Canada, H3A 2A7 hwu19, kemme@cs.mcgill.ca Abstract Replication is widely used in application server products to tolerate faults. An important challenge is to correctly co- ordinate replication and transaction execution for stateful application servers. Many current solutions assume that a single client request generates exactly one transaction at the server. However, it is quite common that several client requests are encapsulated within one server transaction or that a single client request can initiate several server trans- actions. In this paper, we propose a replication tool that is able to handle these variations in request/transaction as- sociation. We have integrated our approach into the J2EE application server JBoss. Our evaluation using the ECPerf benchmark shows a low overhead of the approach. 1 Introduction Application servers (AS) have become a prevalent build- ing block in current information systems. Clients send re- quests to an AS which accesses database systems to manage persistent data. The AS runs the application programs and maintains volatile data, such as session information, i.e., the server is stateful. Requests are executed in the context of transactions which provide durability for the persistent data, isolation from concurrent transactions, and atomicity. In the simplest execution model, each client request executes within its own individual transaction. In practice, however, execution can be more complex. For instance, the client can start a transaction, and then submit several requests in the context of this transaction before committing it. This is, e.g., often used when a web server (WS) is positioned between the real (internet) client and the AS. At the other extreme, one client request might create several indepen- dent transactions in the AS. Application programmers often chop the execution of a request into a set of small transac- tions to avoid lock contention at the database. AS servers are often replicated to achieve 7/24 availabil- ity. If one replica crashes, the work assigned to this replica can failover to another replica. The challenge is to correctly handle requests and transactions that are active at the time of the crash. The AS replication solutions we are aware of only consider the simple case where one request is associated with exactly one transaction [15, 16, 14, 13, 28, 4, 3, 27]. In contrast, we propose a tool that is able to handle different execution patterns as described above. The system should provide exactly-once execution and state consistency even in the case of crashes [15, 27]. Assuming the 1-request/1- transaction pattern, exactly-once means that for each sub- mitted client request, the server executes the corresponding transaction exactly once. State consistency guarantees that the state at AS replicas and database is always consistent. We refine these correctness properties to be able to capture advanced execution patterns. Our tool is based on an existing protocol [27] which as- sumes the simple 1-request/1-transaction pattern. It uses a classical primary/backup approach [18, 21, 15, 13, 2]. One server replica is the primary executing client requests. It propagates state changes to the backup replicas whenever a transaction commits. If the primary fails, a backup replica fails over, reconstructs the state of the old primary, and con- tinues the client connections. Requests that were active at the time the primary crashed (and only those) are automat- ically restarted at the new primary. This paper extends the basic tool to support advanced execution patterns. Our goal is to provide a practical solution with little over- head. Hence, we have developed our replication tool within the context of a concrete AS architecture, namely J2EE [26] and integrated it into the open-source AS JBoss [17]. We believe, however, that the principle ideas can be applied to other kinds of application servers (e.g., CORBA, .NET), and hence, keep the algorithmic description as general as possible. Our performance analysis shows that the approach compares favorably with other fault-tolerant solutions. 2 Background AS architecture We assume the application logic to be programmed within components (Enterprise JavaBeans