Injecting Memory Leaks to Accelerate Software Failures Jing Zhao, Yuliang Jin Computer Science and Tech. Dept. Harbin Engineering University Harbin, China {jingzhao.duke, jyl198803}@gmail.com Kishor S. Trivedi Electrical and Computer Eng. Dept. Duke University Durham, USA kst@ee.duke.edu Rivalino Matias Jr. School of Computer Science Federal University of Uberlandia, Uberlandia, Brazil. rivalino@facom.ufu.br Abstract—A number of studies have reported the phenomenon of “Software aging”, caused by resource exhaustion and characterized by progressive software performance degradation. We develop experiments that simulate an on-line bookstore application, following the standard configuration of TPC-W benchmark. We study the application failures caused by memory leaks, using the accelerated life tests method. In our experiments, the memory consumption rate is selected as the acceleration factor, and an IPL-lognormal model is used to estimate the time to failure at each acceleration level. Subsequently, the estimate of the time to failure distribution at normal condition is obtained. Our acceleration experimental results based on the IPL-lognormal model show that it can be used to greatly reduce the cost to obtain the time to failure at normal level, which can be used in scheduling software rejuvenation. Finally, we select the Weibull time to failure distribution at normal level, to be used in a semi-Markov process, to optimize the software rejuvenation trigger interval. Keywords—accelerated life tests; memory leaks; optimal software rejuvenation; semi-Markov process; software aging I INTRODUCTION Studies show that operational software failures are transient in nature, caused by phenomena such as overloads or timing and exception errors [1]. Grottke et al. classified software faults into three types according to potential manifestation characteristic: Bohrbug, Mandelbug, and Aging-related bug, and then analyzed the faults discovered in the on-board software for 18 JPL/NASA space missions based on this classification method [2]. Aging-related bugs cause an increasing failure rate, gradual software performance degradation, and may eventually lead to a system hang or crash. Software aging is mainly caused by the successive accumulation of the effects of aging-related fault activations. It leads to the exhaustion of system resources, mainly due to memory-leaks, unreleased locks, non-terminated threads, shared-memory pool latching, storage fragmentation, or comparable causes [3], [4]. Many of the causes of software aging are very hard to identify due to their randomness [5]. Hence, it is not uncommon to have unknown aging faults causing known aging effects. This undesired phenomenon exists not only in regular software such as web and application servers, but also in critical applications that require high dependability levels. Software aging could cause great losses in safety-critical systems [6], including the loss of human lives [7]. To counteract software aging, researchers have proposed a proactive approach called software rejuvenation (SR) [3]. Rejuvenation has been implemented in various computing systems, such as billing data collection systems, telecommunication systems, transaction processing systems, and spacecraft systems [8], [9], [10]. It involves occasionally terminating an application process, cleaning its internal state and restarting it in order to release system resources, so that the software performance is recovered. One or more indicators of aging can capture the aging behavior [1], [4], [19]. Such indicators are measurable metrics of the target system likely to be influenced by software aging. The most popular web server on the Internet, the Apache web server [11], is known to suffer from software aging [12]. It has been demonstrated that the extent of software aging depends on the workload imposed on the system. For examples see [1], [12], [13], [14] for Apache web server, and see [15], [16] for Axis. Most of the previous experimental research on software aging and rejuvenation employed Apache web server as a test bed, and then used statistical methods to predict the time to resource exhaustion [3], [12], [14], [16]. Analytic models used for capturing software rejuvenation are based on the assumption that the distribution of time to failure due to software aging is known, and the aim is to determine the optimal times to trigger rejuvenation in order to maximize system availability or related measures [4], [15], [17]. Whatever approach is used rejuvenation scheduling, such as measurement based, analytic, or both, estimated time to failure should be obtained more efficiently. Due to the difficulty in experimentally studying aging-related system failures by observation of failure times, Matias et al. develop a systematic approach to accelerate the aging effects at the experimental level [18]. They introduce the concept of aging factors and use different levels of accelerated workload to increase the system degradation. Based on the degradation data of selected system characteristics, captured through measurements, they apply the statistical technique of accelerated degradation tests (ADT) to estimate the time to failure in normal condition (without acceleration). Alternatively, in [19] the authors do not use degradation data, but directly observe failures obtained also under accelerated workloads. In this case, they use another technique called accelerated life tests (ALT) 2011 22nd IEEE International Symposium on Software Reliability Engineering 1071-9458/11 $26.00 © 2011 IEEE DOI 10.1109/ISSRE.2011.24 260