This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.
IEEE SYSTEMS JOURNAL 1
Fault Tolerance Management in Cloud Computing:
A System-Level Perspective
Ravi Jhawar, Graduate Student Member, IEEE, Vincenzo Piuri, Fellow, IEEE,
and Marco Santambrogio, Senior Member, IEEE
Abstract—The increasing popularity of Cloud computing as an
attractive alternative to classic information processing systems
has increased the importance of its correct and continuous
operation even in the presence of faulty components. In this
paper, we introduce an innovative, system-level, modular per-
spective on creating and managing fault tolerance in Clouds.
We propose a comprehensive high-level approach to shading
the implementation details of the fault tolerance techniques to
application developers and users by means of a dedicated service
layer. In particular, the service layer allows the user to specify and
apply the desired level of fault tolerance, and does not require
knowledge about the fault tolerance techniques that are available
in the envisioned Cloud and their implementations.
Index Terms—Cloud computing, fault tolerance as a service,
fault tolerance properties, system level fault tolerance.
I. Introduction
T
HE INCREASING demand for flexibility in obtaining
and releasing computing resources in a cost-effective
manner has resulted in a wide adoption of the Cloud com-
puting paradigm. The availability of an extensible pool of
resources for the user provides an effective alternative to
deploy applications with high scalability and processing re-
quirements [23]. In general, a Cloud computing infrastructure
is built by interconnecting large-scale virtualized data centers,
and computing resources are delivered to the user over the
Internet in the form of an on-demand service by using virtual
machines (e.g., [1], [2]). While the benefits are immense, this
computing paradigm has significantly changed the dimension
of risks on user’s applications, specifically because the failures
(e.g., server overload, network congestion, hardware faults)
that manifest in the data centers are outside the scope of
the user’s organization [3], [4]. Nevertheless, these failures
impose high implications on the applications deployed in
virtual machines and, as a result, there is an increasing need
to address users’ reliability and availability concerns.
The traditional way of achieving reliable and highly
available software is to make use of fault tolerance methods
at procurement and development time [26]. This implies
Manuscript received October 8, 2011; revised May 8, 2012; accepted June
3, 2012. This work was supported in part by the Italian Ministry of Research
within the PRIN 2008 Project PEPPER (2008SY2PH4).
R. Jhawar and V. Piuri are with the Universit` a degli Studi di Milano, Crema
26013, Italy (e-mail: ravi.jhawar@unimi.it; vincenzo.piuri@unimi.it).
M. Santambrogio is with the Department of Electronics and
Information, Politecnico di Milano, Milan 20133, Italy (e-mail:
marco.santambrogio@polimi.it).
Digital Object Identifier 10.1109/JSYST.2012.2221934
that users must understand fault tolerance techniques
and tailor their applications by considering environment-
specific parameters during the design phase. However, for
the applications to be deployed in the Cloud computing
environment, it is difficult to design a holistic fault tolerance
solution that efficiently combines the failure behavior and
system architecture of the application. This difficulty arises
due to: 1) high system complexity, and 2) abstraction layers
of Cloud computing that release limited information about
the underlying infrastructure to its users.
In contrast with the traditional approach, we advocate a new
dimension where applications deployed in a Cloud computing
infrastructure can obtain required fault tolerance properties
from a third party. To support the new dimension, we extend
our work in [5] and propose an approach to realize general
fault tolerance mechanisms as independent modules such that
each module can transparently function on users’ applications.
We then enrich each module with a set of metadata that charac-
terize its fault tolerance properties, and use the metadata to se-
lect mechanisms that satisfy user’s requirements. Furthermore,
we present a scheme that: 1) delivers a comprehensive fault
tolerance solution to user’s applications by combining selected
fault tolerance mechanisms, and 2) ascertains the properties of
a fault tolerance solution by means of runtime monitoring.
Based on the proposed approach, we design a framework
that easily integrates with the existing Cloud infrastructure
and facilitates a third party in offering fault tolerance as a
service.
This paper is organized as follows. Section II describes
the motivating scenario and basic concepts on fault tolerance.
Section III presents our approach on resource management,
and Section IV outlines our two-stage service delivery scheme
that can transparently offer fault tolerance support to users’
applications. Section V presents the architectural details of
our framework. Section VI summarizes the related work, and
Section VII presents our conclusions.
II. Motivating Scenario and Basic Concepts
In this section, we describe the motivating scenario and
basic concepts on fault tolerance.
A. Motivating Scenario
We consider a highly complex and distributed infrastructure
that involves the following main stakeholders.
1932-8184/$31.00 c 2012 IEEE
© 2012 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future
media, including reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or
redistribution to servers or lists, or reuse of any copyrighted component of this work in other works.