AURA: Recovering from Transient Failures in Cloud Deployments
Ioannis Giannakopoulos
∗
, Ioannis Konstantinou
∗
, Dimitrios Tsoumakos
§
and Nectarios Koziris
∗
∗
Computing Systems Laboratory, School of ECE, NTUA, Athens, Greece
{ggian, ikons, nkoziris}@cslab.ece.ntua.gr
§
Department of Informatics, Ionian University, Corfu, Greece
dtsouma@ionio.gr
Abstract—In this work, we propose AURA, a cloud de-
ployment tool used to deploy applications over providers that
tend to present transient failures. The complexity of modern
cloud environments imparts an error-prone behavior during
the deployment phase of an application, something that hinders
automation and magnifies costs both in terms of time and
money. To overcome this challenge, we propose AURA, a frame-
work that formulates an application deployment as a Directed
Acyclic Graph traversal and re-executes the parts of the graph
that failed. AURA achieves to execute any deployment script
that updates filesystem related resources in an idempotent
manner through the adoption of a layered filesystem technique.
In our demonstration, we allow users to describe, deploy
and monitor applications through a comprehensive UI and
showcase AURA’s ability to overcome transient failures, even
in the most unstable environments.
I. I NTRODUCTION
The advent of the cloud computing era generated new
perspectives regarding the deployment of applications. The
virtualized nature of the compute and storage resources,
allocated in an entirely dynamic, pay-as-you-go manner,
enables the cloud users to call services, exported as a set
of APIs, to allocate Virtual Machines and storage devices
and deploy their applications on top of them. The concept
of programmatic management of resources allows for the
automation of complex deployment tasks that entail resource
initialization (e.g., formatting virtualized block devices), ex-
ecution of the necessary installation and configuration scripts
(e.g., software dependencies installation and configuration
files editing), etc. Simultaneously, the ever increasing com-
plexity of applications architectures imposes the utilization
of a synchronization mechanism among different resources
so as to guarantee that the deployment tasks happen in a
particular order (without hindering parallelism) and, upon
their execution, leave the newly deployed application in a
functional and consistent state.
For the execution of the aforementioned deployment tasks,
several systems have been proposed. Be it a component of
the cloud ecosystem itself, such as Openstack Heat [1] for
Openstack, AWS CloudFormation [2] for Amazon, etc., or
an external software module that communicates with the
provider as a client, such as Vagrant [3], Juju [4], etc., these
systems share a similar view of the deployment process
as they express it as a Directed Acyclic Graph (DAG),
the nodes of which represent the deployment states of the
application modules and the edges represent the deployment
scripts. The DAG itself expresses the order of execution
and the dependencies among different states. Based on
this model, the aforementioned systems first communicate
with the cloud provider in order to allocate the necessary
resources, traverse the deployment DAG and execute the
deployment scripts.
Albeit the correctness of the deployment is guaranteed
if the scripts terminate successfully, the aforementioned
systems do not take into consideration the unstable and error-
prone nature of the cloud [5]. The complexity introduced by
the virtualization and cloud software, often leads to transient
failures that appear for a short period of time and then
vanish: Network glitches, temporal unavailability of network
services such as DNS, read-only file systems attributed to
random errors in the storage backends, etc. are some of
those. Although most of these failures are harmless since
they instantly disappear, they are capable of leading a script
execution into failure and, hence, fail an entire application
deployment. For example, assume that a deployment script
installs specific packages through a package management
system (such as apt) when a network glitch occurs; The
script will fail since the packages will not be retrieved
and, as a consequence, the entire deployment will also
fail. Such failures can prove expensive, both in terms of
budget and time. Taking into consideration that a deployment
failure may also leave stale resources (e.g., VMs that were
successfully provisioned) that require manual handling, it is
apparent that the management of transient failure can prove
extremely beneficial for the deployment process.
In order to tackle the aforementioned limitation of the
existing deployment solutions, we provide AURA
1
a system
used to perform cloud deployments, attempting to overcome
transient failures through re-executing the failed scripts with
the incentive that when the script is re-executed, the failure
will have vanished. AURA is the implementation of the
methodology described in our previous work [6] and it is
1
According to Greek mythology, Aura was the goddess of breeze,
commonly found in cloudy environments.
2017 17th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing
978-1-5090-6611-7/17 $31.00 © 2017 IEEE
DOI 10.1109/CCGRID.2017.133
762