Injecting Memory Leaks to Accelerate Software Failures
Jing Zhao, Yuliang Jin
Computer Science and Tech. Dept.
Harbin Engineering University
Harbin, China
{jingzhao.duke, jyl198803}@gmail.com
Kishor S. Trivedi
Electrical and Computer Eng. Dept.
Duke University
Durham, USA
kst@ee.duke.edu
Rivalino Matias Jr.
School of Computer Science
Federal University of Uberlandia,
Uberlandia, Brazil.
rivalino@facom.ufu.br
Abstract—A number of studies have reported the phenomenon
of “Software aging”, caused by resource exhaustion and
characterized by progressive software performance
degradation. We develop experiments that simulate an on-line
bookstore application, following the standard configuration of
TPC-W benchmark. We study the application failures caused
by memory leaks, using the accelerated life tests method. In
our experiments, the memory consumption rate is selected as
the acceleration factor, and an IPL-lognormal model is used to
estimate the time to failure at each acceleration level.
Subsequently, the estimate of the time to failure distribution at
normal condition is obtained. Our acceleration experimental
results based on the IPL-lognormal model show that it can be
used to greatly reduce the cost to obtain the time to failure at
normal level, which can be used in scheduling software
rejuvenation. Finally, we select the Weibull time to failure
distribution at normal level, to be used in a semi-Markov
process, to optimize the software rejuvenation trigger interval.
Keywords—accelerated life tests; memory leaks; optimal
software rejuvenation; semi-Markov process; software aging
I INTRODUCTION
Studies show that operational software failures are
transient in nature, caused by phenomena such as overloads
or timing and exception errors [1]. Grottke et al. classified
software faults into three types according to potential
manifestation characteristic: Bohrbug, Mandelbug, and
Aging-related bug, and then analyzed the faults discovered
in the on-board software for 18 JPL/NASA space missions
based on this classification method [2]. Aging-related bugs
cause an increasing failure rate, gradual software
performance degradation, and may eventually lead to a
system hang or crash. Software aging is mainly caused by
the successive accumulation of the effects of aging-related
fault activations. It leads to the exhaustion of system
resources, mainly due to memory-leaks, unreleased locks,
non-terminated threads, shared-memory pool latching,
storage fragmentation, or comparable causes [3], [4]. Many
of the causes of software aging are very hard to identify due
to their randomness [5]. Hence, it is not uncommon to have
unknown aging faults causing known aging effects. This
undesired phenomenon exists not only in regular software
such as web and application servers, but also in critical
applications that require high dependability levels. Software
aging could cause great losses in safety-critical systems [6],
including the loss of human lives [7]. To counteract
software aging, researchers have proposed a proactive
approach called software rejuvenation (SR) [3].
Rejuvenation has been implemented in various computing
systems, such as billing data collection systems,
telecommunication systems, transaction processing systems,
and spacecraft systems [8], [9], [10]. It involves
occasionally terminating an application process, cleaning its
internal state and restarting it in order to release system
resources, so that the software performance is recovered.
One or more indicators of aging can capture the aging
behavior [1], [4], [19]. Such indicators are measurable
metrics of the target system likely to be influenced by
software aging.
The most popular web server on the Internet, the
Apache web server [11], is known to suffer from software
aging [12]. It has been demonstrated that the extent of
software aging depends on the workload imposed on the
system. For examples see [1], [12], [13], [14] for Apache
web server, and see [15], [16] for Axis. Most of the
previous experimental research on software aging and
rejuvenation employed Apache web server as a test bed, and
then used statistical methods to predict the time to resource
exhaustion [3], [12], [14], [16]. Analytic models used for
capturing software rejuvenation are based on the
assumption that the distribution of time to failure due to
software aging is known, and the aim is to determine the
optimal times to trigger rejuvenation in order to maximize
system availability or related measures [4], [15], [17].
Whatever approach is used rejuvenation scheduling, such as
measurement based, analytic, or both, estimated time to
failure should be obtained more efficiently. Due to the
difficulty in experimentally studying aging-related system
failures by observation of failure times, Matias et al.
develop a systematic approach to accelerate the aging
effects at the experimental level [18]. They introduce the
concept of aging factors and use different levels of
accelerated workload to increase the system degradation.
Based on the degradation data of selected system
characteristics, captured through measurements, they apply
the statistical technique of accelerated degradation tests
(ADT) to estimate the time to failure in normal condition
(without acceleration). Alternatively, in [19] the authors do
not use degradation data, but directly observe failures
obtained also under accelerated workloads. In this case,
they use another technique called accelerated life tests (ALT)
2011 22nd IEEE International Symposium on Software Reliability Engineering
1071-9458/11 $26.00 © 2011 IEEE
DOI 10.1109/ISSRE.2011.24
260