Using Accelerated Life Tests to Estimate Time to Software Aging Failure
Rivalino Matias Jr.
School of Computer Science
Federal University of Uberlândia
Uberlândia, Brazil
rivalino@facom.ufu.br
Kishor S. Trivedi
Electrical and Computer Eng.
Duke University
Durham, US
kst@ee.duke.edu
Paulo R. M. Maciel
Informatics Center
Federal University of Pernambuco
Recife, Brazil
prmm@cin.ufpe.br
Abstract— Software aging is a phenomenon defined as the
continuing degradation of software systems during runtime,
being particularly noticeable in long-running applications.
Aging-related failures are very difficult to observe, because the
accumulation of aging effects usually requires a long-term
execution. Thus, collecting a statistically significant sample of
times to aging-related failures so as to estimate the system’s
lifetime distribution is a very hard task. This is an important
problem that prevents many experimental and analytical
studies, mainly those focused on modeling of software aging
aspects, of using representative parameter values. In this paper
we propose and evaluate the use of quantitative accelerated life
tests (QALT) to reduce the time to obtain the lifetime
distribution of systems that fail due to software aging. Since
QALT was developed for hardware failures, in this paper, we
adapt it to software aging experiments. We test the proposed
approach experimentally, estimating the lifetime distribution
of a real web server system. The accuracy of the estimated
distribution is evaluated by comparing its reliability estimates
with a sample of failure times observed from the real system
under test. The mean time to failure calculated from the real
sample falls inside the 90% confidence interval constructed
from the estimated lifetime distribution, demonstrating the
high accuracy of the estimated model. The proposed approach
reduces the time required to obtain the failure times by a
factor of seven, for the real system investigated.
Software Aging; accelerated life test; controlled experiments;
software rejuvenation
I. INTRODUCTION
Software aging may be defined as the continuing
performance degradation of software application execution
due to accumulation of numerical errors, greedy resource
allocation policies, non-safe resource releasing strategies,
and le system degradation. This phenomenon is particularly
observable in long-running applications such as web and
application servers.
Fifteen years ago, the notion of software aging was
formally introduced in [1]. Since then, much theoretical and
experimental research is conducted in order to characterize
and understand this important phenomenon. Software aging
can be understood as being a continued and growing
degradation of the software internal state during its
operational life. A general characteristic of this phenomenon
is the gradual performance degradation and/or an increase in
failure rate [2]. Preventive maintenance can help postpone or
prevent the occurrence of failures attributable to this cause.
Such preventive maintenance has been called software
rejuvenation [1] and has been implemented in several special
purposes [3] and at least one commercial system [4]. In order
to determine the time epochs for triggering software
rejuvenation, analytic models [5], monitoring system
resources followed by statistical analysis [6], [7], or a
combination [8], [9] have been advocated. Two types of data
have been collected and used in this context. Non-failure
data of performance variables has been used in several
efforts such as [6], [7], [8] while failure data is needed in
models such as [1], [5], [9]. This paper is concerned with
reducing the time needed to collect time to failure data in
connection with aging-related failures. Probing the system,
gathering data and statistical analysis are then critical tasks.
This is important for the development of prediction
algorithms, and for the parameterization and validation of
analytic models. Collecting data for statistically significant
predictions of software aging phenomenon is typically a
long-lasting task and may be unaffordable in many
circumstances.
To the best of our knowledge, none of the published
software aging experimental studies deal with the reduction
of the time needed to collect the failure time data for
software aging related failures using quantitative accelerated
life test (QALT). In this paper, we present results of using
QALT method to reduce the time to monitor the necessary
data for the estimation of the lifetime distribution of software
systems suffering from software aging. First, we introduce
the theoretical aspects necessary to understand how software
aging failures can be accelerated. Next, we discuss how
QALT can be used in software aging experiments. It is
important to note that QALT method was developed for use
in hardware systems, we need to adapt some of its aspects to
this new field of application involving experimental research
in software aging.
The structure of this paper follows. Section II presents a
detailed view of software failure and specifically software
aging-related failure mechanisms. Section III describes the
QALT method, emphasizing how the standard method needs
to be adapted for application to software systems,
particularly systems suffering from software aging. In
Section IV, we use the proposed approach, based on the
adapted QALT method, to extend previous work in order to
2010 21st International Symposium on Software Reliability Engineering
1071-9458/10 $26.00 © 2010 IEEE
DOI 10.1109/ISSRE.2010.42
211
2010 IEEE 21st International Symposium on Software Reliability Engineering
1071-9458/10 $26.00 © 2010 IEEE
DOI 10.1109/ISSRE.2010.42
211