Software Fault Tolerance Modeling in a Server Virtualized System Ohnmar Nhway Faculty of Computer Systems and Technologies, University of Computer Studies, Myanmar skynhway@gmail.com Abstract—In current age, Information Technology has become the backbone of every business. According to the literature, computer system outages are more often due to software faults than hardware faults. The software’s performance slowly degrades with time due to the exhaustion of operating system resources is called software aging. Software rejuvenation is one of the most important techniques to counteract software aging. There are two types of software rejuvenation policy such as time based software rejuvenation policy and load and time based software rejuvenation policy. In this paper, we present load and time dependent software rejuvenation policy which is used for software fault modeling in a server virtualized system. We consider the aging behavior of the system by time, while the actual load of the system as well. This new model is meant to be the running 24x7 services with a zero downtime in most of the cases. A Markov Stochastic Petri Net (MSPN) model is constructed to represent the behavior of the system. The analysis of the system availability not only is calculated for the numerical derivation but also is carried out the SHARPE tool simulation. Finally, we show that both numerical derivation and simulation’s results are the same for evaluating the system availability of this new model. Keywords-Availability; Stochastic Petri Net; Software Aging; Software Rejuvenation Policy; Virtualization. I. INTRODUCTION Nowaday, Information Technology has become the backbone of every business. The business continuity is a key objective of an organization; it means that operations are up and running 24x7. The modern society depends on the fault-free operation of complex computing systems; system fault-tolerance has become an important issue. Common agreement exists that large software systems always contain faults and precautions must be taken to avoid system failure. Failure of hardware components often is caused by external factors that can be neither predicted nor corrected. Therefore, mechanisms are needed that guarantee correct service in the presence of failure of system components, be it software or hardware elements [5]. A number of recent studies have reported the phenomenon of “software ages”, characterized by progressive performance degradation and/or an increased occurrence rate of hang/crash failures of a software system due to the exhaustion of operating system resources. To counteract this phenomenon, a proactive technique called “software rejuvenation” has been proposed. The contribution of this paper is both the load and time based rejuvenation policy from software rejuvenation methodology and virtualization technology which are combined to counteract the software aging problem for a server virtualized system. The behavior of the system is represented through a Markov Stochastic Petri Net (MSPN) model which is subsequently solved for steady state as well as transient conditions. But, we expect that there would be a trade-off involved between the down time caused due to crash failures and down time due to rejuvenation depending on how often it is performed. The rest of this paper is organized as follows. In Section 2, we discuss the related works, their evaluation technique and result. In Section 3, we explain our proposed load and time dependent software rejuvenation policy, state transition and reachability analysis using multiple VMs (Virtual Machines) based on single physical server in detail. The results of analytic analysis follow in Section 4. Finally, we conclude our paper in Section 5. II. RELATED WORKS According to the literature, the software has the ability to recover from a transient fault. Most of the approaches such as N-version programming and recovery block are corrective in nature, i.e. only after a failure has occurred, recovery is started. But, the overhead incurred by such recovery strategies remains high and much research was done to reduce it. Y.Huang et. al. [10] has suggested a software rejuvenation technique which is preventive in nature. It involves periodic maintenance of the software so as to prevent crash failures. They define it as the periodic preemptive rollback of continuously running applications to prevent failures. While monitoring real applications, it was observed that software typically "ages" as it is run. Potential fault conditions are thus slowly accumulated since the beginning of the software activity. The work most closely related to our work is the one provided by T.Thein et al. [8],[9]. They use a timely rejuvenation policy in a high available consolidated system. Different configurations of consolidated servers in the form of one physical and two physical servers in the scheme of hot standby are considered. However, they do not consider the VMM failure and its rejuvenation issues. We propose load and time based software rejuvenation policy for a server virtualized systems considering the software aging problem and rejuvenation of VMs.