I. J. Computer Network and Information Security, 2012, 5, 29-38 Published Online June 2012 in MECS (http://www.mecs-press.org/) DOI: 10.5815/ijcnis.2012.05.04 Copyright © 2012 MECS I.J. Computer Network and Information Security, 2012, 5, 29-38 Evaluating Overheads of Integrated Multilevel Checkpointing Algorithms in Cloud Computing Environment Dilbag Singh, Jaswinder Singh, Amit Chhabra Dept. of Computer Science & Engineering, Guru Nanak Dev University Amritsar, Punjab, 143001, India Dggill2@gmail.com, chhabra.amit78@gmail.com, jaswindersingh@yahoo.com Abstract — This paper presents a methodology for providing high availability to the demands of cloud's clients. To attain this objective, failover stratagems for cloud computing using integrated checkpointing algorithms are purposed in this paper. Purposed strategy integrate checkpointing feature with load balancing algorithms and also make multilevel checkpoint to decrease checkpointing overheads. For implementation of purposed failover strategies, a cloud simulation environment is developed, which has the ability to provide high availability to clients in case of failure/recovery of service nodes. \\The primary objective of this research work is to improve the checkpoint efficiency and prevent checkpointing from becoming the bottleneck of cloud data centers. In order to find an efficient checkpoint interval, checkpointing overheads has also considered in this paper. By varying rerun time of checkpoints comparison tables are made which can be used to find optimal checkpoint interval. The purposed failover strategy will work on application layer and provide highly availability for Platform as a Service (PaaS) feature of cloud computing. Index Terms — Failover, Load balancing, Node- recovery, Multilevel checkpointing, Restartation I. I NTRODUCTION Cloud computing [1], [2], [3] is currently emerging as a powerful way to transform the IT industry to build and deploy custom applications. In cloud environment jobs keep on arriving to the data centers for execution and nodes will be allocated to the jobs for their execution as per their requirements and successfully executed jobs will leave the nodes. In this scenario it may possible that some nodes will become inactive while executing threads due to some failure. So there is need of efficient failover strategy for handling failures as it may cause restartation of entire work, whether some threads of the job has been successfully done on other nodes. In case of node failure, that means, the node is no longer accessible to service any demand of clients, the cloud must migrate jobs to the other node. A checkpoint is a local state of a job saved on stable storage. By periodically executing the checkpointing, one can save the status of a process at consistent intervals [17], [18]. If there is a failure, one may resume computation from the earlier checkpoints, thereby, avoiding restating execution from the beginning. The process of restarting computation by rolling back to a consistent state is called rollback recovery. In cloud computing environment, since the nodes in the data centers do not share memory [19], therefore it is required to transfer the load of failed node to other nodes in case of any sort of failure. In this paper, checkpoints are integrated with load balancing algorithms for data centers (cloud computing infrastructure) has been considered, taking into account the several constraints such as handling infrastructure sharing, availability, failover and prominence on customer service. These issues are addressed by proposing a smart failover strategy which will provide high availability to the requests of the clients. New cloud simulation environment has been purposed in this paper, which has the ability to keep all the nodes busy for achieving load balancing and also execute checkpoints for achieving failover successfully. An integrated checkpointing algorithm implements in parallel with the essential computation. Therefore, the overheads presented due to checkpointing should need to be reduced. Checkpointing should enable a CSP to provide high availability to the requests of the clients in case of failure, which demands frequent checkpointing and therefore significant overheads will be introduced. So it becomes more critical to set checkpointing rerun time. Multilevel checkpoints [9], [10], [11], [12], [13], [14], [15] are used in this research work for decreasing the overheads of checkpoints. A. Parameters and Metrics used in this paper TABLE I. Parameters and Metrics used in this paper Parameter name Meaning C Checkpoint overhead L Checkpoint Latency R Time required for job migration t Time spent on computation t1 No. of time C runs t2 No. ot time R occurs r Chcekpointing ratio