Using Proactive Fault-Tolerance Approach to Enhance Cloud Service Reliability Jialei Liu, Shangguang Wang , Senior Member, IEEE, Ao Zhou, Sathish A. P. Kumar, Senior Member, IEEE, Fangchun Yang, and Rajkumar Buyya , Fellow, IEEE Abstract—The large-scale utilization of cloud computing services for hosting industrial/enterprise applications has led to the emergence of cloud service reliability as an important issue for both cloud service providers and users. To enhance cloud service reliability, two types of fault tolerance schemes, reactive and proactive, have been proposed. Existing schemes rarely consider the problem of coordination among multiple virtual machines (VMs) that jointly complete a parallel application. Without VM coordination, the parallel application execution results will be incorrect. To overcome this problem, we first propose an initial virtual cluster allocation algorithm according to the VM characteristics to reduce the total network resource consumption and total energy consumption in the data center. Then, we model CPU temperature to anticipate a deteriorating physical machine (PM). We migrate VMs from a detected deteriorating PM to some optimal PMs. Finally, the selection of the optimal target PMs is modeled as an optimization problem that is solved using an improved particle swarm optimization algorithm. We evaluate our approach against five related approaches in terms of the overall transmission overhead, overall network resource consumption, and total execution time while executing a set of parallel applications. Experimental results demonstrate the efficiency and effectiveness of our approach. Index Terms—Cloud data center, cloud service reliability, fault tolerance (FT), particle swarm optimization (PSO), virtual cluster Ç 1 INTRODUCTION C LOUD computing is widely adopted in current profes- sional and personal environments. It employs several existing technologies and concepts, such as virtual servers and data centers, and gives them a new perspective [1]. Fur- thermore, it enables users and businesses to not only use applications without installing them on their machines but also access resources on any computer via the Internet [2]. With its pay-per-use business model for customers, cloud computing shifts the capital investment risk for under- or overprovisioning to cloud providers. Therefore, several leading technology companies, such as Google, Amazon, IBM, and Microsoft, operate large-scale cloud data centers around the world. With the growing popularity of cloud computing, modern cloud data centers are employing tens of thousands of physical machines (PMs) networked via hundreds of routers/switches that communicate and coor- dinate to deliver highly reliable cloud computing services. Although the failure probability of a single device/link might be low [3], it is magnified across all the devices/links hosted in a cloud data center owing to the problem of coor- dination of PMs. Moreover, multiple fault sources (e.g., soft- ware, human errors, and hardware) are the norm rather than the exception [4]. Thus, downtime is common and seri- ously affects the service level of cloud computing [5]. There- fore, enhancing cloud service reliability is a critical issue that requires immediate attention. Over the past few years, numerous fault tolerance (FT) approaches have been proposed to enhance cloud service reliability [6], [7]. It is well known that FT consists of fault detection, backup, and failure recovery, and nearly all FT approaches are based on the use of redundancy. Currently, two basic mechanisms, namely, replication and checkpoint- ing, are widely adopted. In the replication mechanism, the same task is synchronously or asynchronously handled on several virtual machines (VMs) [8], [9], [10]. This mecha- nism ensures that at least one replica is able to complete the task on time. Nevertheless, because of its high implementa- tion cost, the replication mechanism is more suitable for real time or critical cloud services. The checkpointing mecha- nism is categorized into two main types: independent checkpoint mechanisms that only consider a whole applica- tion to perform on a VM, and coordinated checkpoint mech- anisms that consider multiple VMs (i.e., a virtual cluster) to jointly execute parallel applications [11], [12], [13], [14], [15], [16]. The two types of mechanisms periodically save the J. Liu is with the State Key Laboratory of Networking and Switching Technology, Beijing University of Posts and Telecommunications, Beijing, China and the Department of Computer Science and Information Engi- neering, Anyang Institute of Technology, Anyang, China. E-mail: liujialei@bupt.edu.cn. S. Wang, A. Zhou, and F. Yang are with the State Key Laboratory of Networking and Switching Technology, Beijing University of Posts and Telecommunications, Beijing, China. E-mail: {sgwang, aozhou, fcyang}@bupt.edu.cn. S.A.P. Kumar is with the Department of Computer Science and Informa- tion Systems, Coastal Carolina University, Conway, SC 29528-6054. E-mail: skumar@coastal.edu. R. Buyya is with Cloud Computing and Distributed Systems (CLOUDS) Laboratory, Department of Computing and Information Systems, The Uni- versity of Melbourne, Melbourne, Vic. 3010, Australia. E-mail: rbuyya@unimelb.edu.au. Manuscript received 20 Sept. 2015; revised 15 Mar. 2016; accepted 29 Apr. 2016. Date of publication 13 May 2016; date of current version 5 Dec. 2018. Recommended for acceptance by D. Lie. For information on obtaining reprints of this article, please send e-mail to: reprints@ieee.org, and reference the Digital Object Identifier below. Digital Object Identifier no. 10.1109/TCC.2016.2567392 IEEE TRANSACTIONS ON CLOUD COMPUTING, VOL. 6, NO. 4, OCTOBER-DECEMBER 2018 1191 2168-7161 ß 2016 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See ht_tp://www.ieee.org/publications_standards/publications/rights/index.html for more information.