Using Accelerated Life Tests to Estimate Time to Software Aging Failure Rivalino Matias Jr. School of Computer Science Federal University of Uberlândia Uberlândia, Brazil rivalino@facom.ufu.br Kishor S. Trivedi Electrical and Computer Eng. Duke University Durham, US kst@ee.duke.edu Paulo R. M. Maciel Informatics Center Federal University of Pernambuco Recife, Brazil prmm@cin.ufpe.br Abstract— Software aging is a phenomenon defined as the continuing degradation of software systems during runtime, being particularly noticeable in long-running applications. Aging-related failures are very difficult to observe, because the accumulation of aging effects usually requires a long-term execution. Thus, collecting a statistically significant sample of times to aging-related failures so as to estimate the system’s lifetime distribution is a very hard task. This is an important problem that prevents many experimental and analytical studies, mainly those focused on modeling of software aging aspects, of using representative parameter values. In this paper we propose and evaluate the use of quantitative accelerated life tests (QALT) to reduce the time to obtain the lifetime distribution of systems that fail due to software aging. Since QALT was developed for hardware failures, in this paper, we adapt it to software aging experiments. We test the proposed approach experimentally, estimating the lifetime distribution of a real web server system. The accuracy of the estimated distribution is evaluated by comparing its reliability estimates with a sample of failure times observed from the real system under test. The mean time to failure calculated from the real sample falls inside the 90% confidence interval constructed from the estimated lifetime distribution, demonstrating the high accuracy of the estimated model. The proposed approach reduces the time required to obtain the failure times by a factor of seven, for the real system investigated. Software Aging; accelerated life test; controlled experiments; software rejuvenation I. INTRODUCTION Software aging may be defined as the continuing performance degradation of software application execution due to accumulation of numerical errors, greedy resource allocation policies, non-safe resource releasing strategies, and le system degradation. This phenomenon is particularly observable in long-running applications such as web and application servers. Fifteen years ago, the notion of software aging was formally introduced in [1]. Since then, much theoretical and experimental research is conducted in order to characterize and understand this important phenomenon. Software aging can be understood as being a continued and growing degradation of the software internal state during its operational life. A general characteristic of this phenomenon is the gradual performance degradation and/or an increase in failure rate [2]. Preventive maintenance can help postpone or prevent the occurrence of failures attributable to this cause. Such preventive maintenance has been called software rejuvenation [1] and has been implemented in several special purposes [3] and at least one commercial system [4]. In order to determine the time epochs for triggering software rejuvenation, analytic models [5], monitoring system resources followed by statistical analysis [6], [7], or a combination [8], [9] have been advocated. Two types of data have been collected and used in this context. Non-failure data of performance variables has been used in several efforts such as [6], [7], [8] while failure data is needed in models such as [1], [5], [9]. This paper is concerned with reducing the time needed to collect time to failure data in connection with aging-related failures. Probing the system, gathering data and statistical analysis are then critical tasks. This is important for the development of prediction algorithms, and for the parameterization and validation of analytic models. Collecting data for statistically significant predictions of software aging phenomenon is typically a long-lasting task and may be unaffordable in many circumstances. To the best of our knowledge, none of the published software aging experimental studies deal with the reduction of the time needed to collect the failure time data for software aging related failures using quantitative accelerated life test (QALT). In this paper, we present results of using QALT method to reduce the time to monitor the necessary data for the estimation of the lifetime distribution of software systems suffering from software aging. First, we introduce the theoretical aspects necessary to understand how software aging failures can be accelerated. Next, we discuss how QALT can be used in software aging experiments. It is important to note that QALT method was developed for use in hardware systems, we need to adapt some of its aspects to this new field of application involving experimental research in software aging. The structure of this paper follows. Section II presents a detailed view of software failure and specifically software aging-related failure mechanisms. Section III describes the QALT method, emphasizing how the standard method needs to be adapted for application to software systems, particularly systems suffering from software aging. In Section IV, we use the proposed approach, based on the adapted QALT method, to extend previous work in order to 2010 21st International Symposium on Software Reliability Engineering 1071-9458/10 $26.00 © 2010 IEEE DOI 10.1109/ISSRE.2010.42 211 2010 IEEE 21st International Symposium on Software Reliability Engineering 1071-9458/10 $26.00 © 2010 IEEE DOI 10.1109/ISSRE.2010.42 211