Performance and Availability Aware Regeneration For Cloud Based Multitier Applications Gueyoung Jung† Kaustubh R. Joshi ‡ Matti A. Hiltunen ‡ Richard D. Schlichting ‡ Calton Pu † † College of Computing Georgia Institute of Technology Atlanta, GA, USA {gueyoung.jung,calton}@cc.gatech.edu ‡ AT&T Labs Research 180 Park Ave. Florham Park, NJ, USA {kaustubh,hiltunen,rick}@research.att.com Abstract Virtual machine technology enables agile system de- ployments in which software components can be cheaply moved, replicated, and allocated hardware resources in a controlled fashion. This paper examines how these facili- ties can be used to provide enhanced solutions to the clas- sic problem of ensuring high availability while maintain- ing performance. By regenerating software components to restore the redundancy of a system whenever failures oc- cur, we achieve improved availability compared to a sys- tem with a ﬁxed redundancy level. Moreover, by smartly controlling component placement and resource allocation using information about application control ﬂow and per- formance predictions from queuing models, we ensure that the resulting performance degradation is minimized. We consider an environment in which a collection of multi- tier enterprise applications operates across multiple hosts, racks, clusters, and data centers to maximize failure inde- pendence. Simulation results show that our proposed ap- proach provides better availability and signiﬁcantly lower degradation of system response times compared to tradi- tional approaches. 1. Introduction High availability and low response time are crucial, al- though often conﬂicting, requirements for the multitier ap- plications that implement critical business functionality for many enterprises. Ensuring high availability requires the applications to be deployed with sufﬁcient redundancy, po- tentially spanning several data centers, while distributed deployment and replication impose a performance penalty. Redundancy is traditionally ensured by using reliable hard- ware components with high mean time between failures (MTBF) and quick repair or replacement of failed com- ponents, i.e., low mean time to repair (MTTR). However, current trends in system and data center design are chang- ing the role of repair. Multitier systems are increasingly running on large numbers of cheap, less reliable commod- ity components, thus leading to a decrease in MTBF of the system components. For example, Google reported an av- erage of 1000 node failures/yr in their typical 1800 node cluster for a cluster MTBF of 8.76 hours [11]. Mean- while, skilled manpower is quickly becoming the most ex- pensive resource, thus encouraging data center operators to achieve economies of scale by batching repairs and re- placement, and increasing MTTR in the process. In fact, portable “data-center in a box” designs (e.g., [15]) that con- tain tightly packed individual components that are com- pletely non-serviceable, i.e., with an inﬁnite MTTR, are emerging. These trends imply that applications will increasingly operate in environments in which parts of the infrastructure are in a failed state. Replication of software components is a standard technique used to ensure high availability. The level of redundancy must be high enough to tolerate addi- tional failures until repairs eventually take place. Maintain- ing such redundancy under the low MTBF and high MTTR conditions is expensive (e.g., cost of hardware and software licenses) and may have a signiﬁcant performance over- head (e.g., state replication). Meanwhile, reducing effec- tive time-to-repair by maintaining standby spare resources that can be quickly deployed automatically is inefﬁcient be- cause the spares represent unutilized resources. We present a solution that ensures high availability while maintaining good performance by employing all sys- tem resources (e.g., no idle standby resources) and lim- ited levels of replication. Speciﬁcally, when a hardware resource fails, we regenerate the affected software compo- nents and deploy them on the remaining resources so that the required availability and performance goals of all the applications in the system are met as long as possible. The regeneration-based approach can provide high availability