An Approach for Estimation of Software Aging in a Web Server Lei Li, Kalyanaraman Vaidyanathan and Kishor S. Trivedi Dept. of Electrical & Computer Engineering Duke University, Durham, NC 27708, USA ll,kv,kst @ee.duke.edu Abstract A number of recent studies have reported the phe- nomenon of “software aging”, characterized by progres- sive performance degradation or a sudden hang/crash of a software system due to exhaustion of operating system resources, fragmentation and accumulation of errors. To counteract this phenomenon, a proactive technique called “software rejuvenation” has been proposed. This essen- tially involves stopping the running software, cleaning its internal state and then restarting it. Software rejuvenation, being preventive in nature, begs the question as to when to schedule it. Periodic rejuvenation, while straightforward to implement, may not yield the best results. A better ap- proach is based on actual measurement of system resource usage and activity that detects and estimates resource ex- haustion times. Estimating the resource exhaustion times makes it possible for software rejuvenation to be initiated or better planned so that the system availability is maximized in the face of time-varying workload and system behavior. In this paper, we propose a methodology based on time- series analysis to detect and estimate resource exhaustion times due to software aging in a web server while subject- ing it to an artificial workload. We first collect and log data on several system resource usage and activity parameters on a web server. Time-series ARMA models are then con- structed from the data to detect aging and estimate resource exhaustion times. The results are then compared with previ- ous measurement-based models and found to be more effi- cient and computationally less intensive. These models can be used to develop proactive management techniques like software rejuvenation which are triggered by actual mea- surements. 1. Introduction It has now been well established that software faults are the major cause of computer system failures [12, 21]. Re- cently, a phenomenon called software aging has been stud- ied and widely reported by researchers [9, 13]. Software aging is a phenomenon observed in a software application executing continuously for a long period of time, where the state of software degrades and leads to performance degra- dation, hang/crash failures or both. The main causes of ag- ing are exhaustion of operating system resources, data cor- ruption and numerical error accumulation. Some common examples of software aging include memory leaks, unre- leased file descriptors and numerical round-off errors. In order to counteract this problem, Huang et al. [13] pro- posed the technique of software rejuvenation [26] which in- volves occasionally stopping the software application, re- moving the accrued error conditions and then restarting the application in a clean environment. This process removes the accumulated errors, and frees up or defragments operat- ing system resources, thus preventing, in a proactive man- ner, unplanned and potentially expensive future system out- ages. Rejuvenation has been implemented in various types of systems - telecommunication systems [1, 3], transaction processing systems [6], web servers [28, 31], cluster servers [5, 14], spacecraft systems [22] and safety-critical systems [16]. Measurement-based studies of software aging and reju- venation on general computer systems has been carried out in previous work [5, 9, 23]. In this paper, we develop a new measurement-based approach using time-series analy- sis to detect software aging and estimate resource exhaus- tion times due to aging in a web server. A web server is a typical long running software system which should ide- ally run forever. In practice, the performance of web server degrades after a period of running due to unfixed bugs in application or system software. System administrators usu- ally reboot the whole system or just restart the web server program occasionally to deal with this problem. While most administrators set the interval between restarts from experi- ence or restart the web server only after it crashes, the objec- tive of our research is to predict the exhaustion of the system resources, thus providing useful information for determin- ing the appropriate time for restart. Our experiments are conducted on an Apache web server running on the Linux