This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination. IEEE SYSTEMS JOURNAL 1 Fault Tolerance Management in Cloud Computing: A System-Level Perspective Ravi Jhawar, Graduate Student Member, IEEE, Vincenzo Piuri, Fellow, IEEE, and Marco Santambrogio, Senior Member, IEEE Abstract—The increasing popularity of Cloud computing as an attractive alternative to classic information processing systems has increased the importance of its correct and continuous operation even in the presence of faulty components. In this paper, we introduce an innovative, system-level, modular per- spective on creating and managing fault tolerance in Clouds. We propose a comprehensive high-level approach to shading the implementation details of the fault tolerance techniques to application developers and users by means of a dedicated service layer. In particular, the service layer allows the user to specify and apply the desired level of fault tolerance, and does not require knowledge about the fault tolerance techniques that are available in the envisioned Cloud and their implementations. Index Terms—Cloud computing, fault tolerance as a service, fault tolerance properties, system level fault tolerance. I. Introduction T HE INCREASING demand for flexibility in obtaining and releasing computing resources in a cost-effective manner has resulted in a wide adoption of the Cloud com- puting paradigm. The availability of an extensible pool of resources for the user provides an effective alternative to deploy applications with high scalability and processing re- quirements [23]. In general, a Cloud computing infrastructure is built by interconnecting large-scale virtualized data centers, and computing resources are delivered to the user over the Internet in the form of an on-demand service by using virtual machines (e.g., [1], [2]). While the benefits are immense, this computing paradigm has significantly changed the dimension of risks on user’s applications, specifically because the failures (e.g., server overload, network congestion, hardware faults) that manifest in the data centers are outside the scope of the user’s organization [3], [4]. Nevertheless, these failures impose high implications on the applications deployed in virtual machines and, as a result, there is an increasing need to address users’ reliability and availability concerns. The traditional way of achieving reliable and highly available software is to make use of fault tolerance methods at procurement and development time [26]. This implies Manuscript received October 8, 2011; revised May 8, 2012; accepted June 3, 2012. This work was supported in part by the Italian Ministry of Research within the PRIN 2008 Project PEPPER (2008SY2PH4). R. Jhawar and V. Piuri are with the Universit` a degli Studi di Milano, Crema 26013, Italy (e-mail: ravi.jhawar@unimi.it; vincenzo.piuri@unimi.it). M. Santambrogio is with the Department of Electronics and Information, Politecnico di Milano, Milan 20133, Italy (e-mail: marco.santambrogio@polimi.it). Digital Object Identifier 10.1109/JSYST.2012.2221934 that users must understand fault tolerance techniques and tailor their applications by considering environment- specific parameters during the design phase. However, for the applications to be deployed in the Cloud computing environment, it is difficult to design a holistic fault tolerance solution that efficiently combines the failure behavior and system architecture of the application. This difficulty arises due to: 1) high system complexity, and 2) abstraction layers of Cloud computing that release limited information about the underlying infrastructure to its users. In contrast with the traditional approach, we advocate a new dimension where applications deployed in a Cloud computing infrastructure can obtain required fault tolerance properties from a third party. To support the new dimension, we extend our work in [5] and propose an approach to realize general fault tolerance mechanisms as independent modules such that each module can transparently function on users’ applications. We then enrich each module with a set of metadata that charac- terize its fault tolerance properties, and use the metadata to se- lect mechanisms that satisfy user’s requirements. Furthermore, we present a scheme that: 1) delivers a comprehensive fault tolerance solution to user’s applications by combining selected fault tolerance mechanisms, and 2) ascertains the properties of a fault tolerance solution by means of runtime monitoring. Based on the proposed approach, we design a framework that easily integrates with the existing Cloud infrastructure and facilitates a third party in offering fault tolerance as a service. This paper is organized as follows. Section II describes the motivating scenario and basic concepts on fault tolerance. Section III presents our approach on resource management, and Section IV outlines our two-stage service delivery scheme that can transparently offer fault tolerance support to users’ applications. Section V presents the architectural details of our framework. Section VI summarizes the related work, and Section VII presents our conclusions. II. Motivating Scenario and Basic Concepts In this section, we describe the motivating scenario and basic concepts on fault tolerance. A. Motivating Scenario We consider a highly complex and distributed infrastructure that involves the following main stakeholders. 1932-8184/$31.00 c 2012 IEEE © 2012 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works.