IEEE TRANSACTION ON DEPENDABLE AND SECURE COMPUTING, VOL. X, NO. Y, JANUARY 2013 1 Local Recovery for High Availability in Strongly Consistent Cloud Services James W. Anderson, Hein Meling, Alexander Rasmussen, Amin Vahdat, and Keith Marzullo Abstract—Emerging cloud-based network services must deliver both good performance and high availability. Achieving both of these goals requires content replication across multiple sites. Many cloud-based services either require or would benefit from the semantics and simplicity of strong consistency. However, replication techniques for strong consistency can severely limit the availability of replicated services when recovering large data objects over wide-area links. To address this problem, we present the design and implementation of ZORFU, a hierarchical system architecture for replication across data centers. The primary contribution of ZORFU is a local recovery technique that significantly increases availability of replicated strongly consistent services. Local recovery achieves this by reducing the recovery time by an order of magnitude, while imposing only a negligible latency overhead. Experimental results show that ZORFU can recover a 100MB object in 4ms. Index Terms—Wide-area state machine replication; Hierarchical replication; Paxos; Local recovery; Dependability analysis. ✦ 1 I NTRODUCTION Traditional desktop applications, such as word pro- cessing, email, and photo management are increasingly moving to server-based deployments. However, moving applications to the cloud can reduce availability because Internet path availability averages only two-nines [1]. If a user’s application state is isolated on a single server, the availability for that user is limited by the path availability between the user’s desktop and that server. Hence, to improve availability, application state must be replicated across multiple servers placed in geographi- cally distributed data centers. Replicating state across data centers, however, makes it harder to maintain consistency across updates. Main- taining strong consistency [2] is essential for correct system behavior for many cloud-based services. Other services that tolerate weak consistency [3] may still benefit from the simplified semantics offered by strong consistency. Examples include collaborative applications, electronic commerce, and financial analysis. The need for strong consistency has also been recognized by promi- nent cloud providers [4], [5], [6]. Providing strong consistency for data hosted at mul- tiple sites is difficult because different updates, possibly from different users, could be directed at servers in different data centers. Even if updates are directed to the same server, the order in which multiple updates are committed to application state must be the same across all copies of the state. • J. W. Anderson, A. Rasmussen, A. Vahdat and K. Marzullo are with the Department of Computer Science and Engineering, University of California San Diego. E-mail: jwanderson@gmail.com, alexras@acm.org, vahdat@cs.ucsd.edu, marzullo@cs.ucsd.edu. • H. Meling is with the Department of Electrical Engineering and Computer Science, University of Stavanger, Norway. E-mail: hein.meling@uis.no. Strong consistency is typically achieved using a Repli- cated State Machine (RSM) model [2], based on a con- sensus algorithm such as Paxos [7], [8] to order state machine operations. In this context, a fundamental re- quirement for strong consistency in RSMs is that 2f +1 replicas are needed to tolerate f failures. Moreover, we target real-world deployment with billions of objects [9] stored across several data centers, each with tens of thousands of machines. At this scale, there is constantly a need to recover from common machine failures, and to do so without human intervention. In this paper, we address the problem of automated recovery of RSMs across geographically distributed data centers connected by a wide-area network. In this sce- nario, we identify a window of vulnerability while recov- ering from a failure, during which a subsequent failure can cause the RSM to block indefinitely. This situation demands manual recovery, which would significantly reduce the system’s availability. This can happen if more than f failures occur before completing recovery from previous failures. It can also happen if the RSM state is not synchronized with a replacement replica before a subsequent failure occurs. Thus, despite the availability of at least f +1 replicas, application-level RSM state is not synchronized sufficiently quickly to allow the RSM to safely 1 process updates. The existence of this window of vulnerability directly affects system availability, and it becomes particularly problematic when objects are large or synchronization takes place over relatively slow or congested wide-area links. These are exactly the scenar- ios we target. The principal contribution of this work is ZORFU,a system architecture for hierarchical replication designed to increase RSM availability by reducing the window of vulnerability that occurs during failure recovery. To 1. An RSM needs f +1 replicas to make progress, and remain safe.