AURA: Recovering from Transient Failures in Cloud Deployments Ioannis Giannakopoulos ∗ , Ioannis Konstantinou ∗ , Dimitrios Tsoumakos § and Nectarios Koziris ∗ ∗ Computing Systems Laboratory, School of ECE, NTUA, Athens, Greece {ggian, ikons, nkoziris}@cslab.ece.ntua.gr § Department of Informatics, Ionian University, Corfu, Greece dtsouma@ionio.gr Abstract—In this work, we propose AURA, a cloud de- ployment tool used to deploy applications over providers that tend to present transient failures. The complexity of modern cloud environments imparts an error-prone behavior during the deployment phase of an application, something that hinders automation and magniﬁes costs both in terms of time and money. To overcome this challenge, we propose AURA, a frame- work that formulates an application deployment as a Directed Acyclic Graph traversal and re-executes the parts of the graph that failed. AURA achieves to execute any deployment script that updates ﬁlesystem related resources in an idempotent manner through the adoption of a layered ﬁlesystem technique. In our demonstration, we allow users to describe, deploy and monitor applications through a comprehensive UI and showcase AURA’s ability to overcome transient failures, even in the most unstable environments. I. I NTRODUCTION The advent of the cloud computing era generated new perspectives regarding the deployment of applications. The virtualized nature of the compute and storage resources, allocated in an entirely dynamic, pay-as-you-go manner, enables the cloud users to call services, exported as a set of APIs, to allocate Virtual Machines and storage devices and deploy their applications on top of them. The concept of programmatic management of resources allows for the automation of complex deployment tasks that entail resource initialization (e.g., formatting virtualized block devices), ex- ecution of the necessary installation and conﬁguration scripts (e.g., software dependencies installation and conﬁguration ﬁles editing), etc. Simultaneously, the ever increasing com- plexity of applications architectures imposes the utilization of a synchronization mechanism among different resources so as to guarantee that the deployment tasks happen in a particular order (without hindering parallelism) and, upon their execution, leave the newly deployed application in a functional and consistent state. For the execution of the aforementioned deployment tasks, several systems have been proposed. Be it a component of the cloud ecosystem itself, such as Openstack Heat [1] for Openstack, AWS CloudFormation [2] for Amazon, etc., or an external software module that communicates with the provider as a client, such as Vagrant [3], Juju [4], etc., these systems share a similar view of the deployment process as they express it as a Directed Acyclic Graph (DAG), the nodes of which represent the deployment states of the application modules and the edges represent the deployment scripts. The DAG itself expresses the order of execution and the dependencies among different states. Based on this model, the aforementioned systems ﬁrst communicate with the cloud provider in order to allocate the necessary resources, traverse the deployment DAG and execute the deployment scripts. Albeit the correctness of the deployment is guaranteed if the scripts terminate successfully, the aforementioned systems do not take into consideration the unstable and error- prone nature of the cloud [5]. The complexity introduced by the virtualization and cloud software, often leads to transient failures that appear for a short period of time and then vanish: Network glitches, temporal unavailability of network services such as DNS, read-only ﬁle systems attributed to random errors in the storage backends, etc. are some of those. Although most of these failures are harmless since they instantly disappear, they are capable of leading a script execution into failure and, hence, fail an entire application deployment. For example, assume that a deployment script installs speciﬁc packages through a package management system (such as apt) when a network glitch occurs; The script will fail since the packages will not be retrieved and, as a consequence, the entire deployment will also fail. Such failures can prove expensive, both in terms of budget and time. Taking into consideration that a deployment failure may also leave stale resources (e.g., VMs that were successfully provisioned) that require manual handling, it is apparent that the management of transient failure can prove extremely beneﬁcial for the deployment process. In order to tackle the aforementioned limitation of the existing deployment solutions, we provide AURA 1 a system used to perform cloud deployments, attempting to overcome transient failures through re-executing the failed scripts with the incentive that when the script is re-executed, the failure will have vanished. AURA is the implementation of the methodology described in our previous work [6] and it is 1 According to Greek mythology, Aura was the goddess of breeze, commonly found in cloudy environments. 2017 17th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing 978-1-5090-6611-7/17 $31.00 © 2017 IEEE DOI 10.1109/CCGRID.2017.133 762