J2EE Instrumentation for software aging root cause application component determination with AspectJ Javier Alonso and Jordi Torres Barcelona Supercomputing Center Dept. of Computer Architecture Technical University of Catalonia Barcelona, Spain Email: [alonso, torres]@ac.upc.edu Josep Ll. Berral Dept. of Software Dept. of Computer Architecture Technical University of Catalonia Barcelona, Spain Email: berral@ac.upc.edu Ricard Gavald` a Dept. of Software Technical University of Catalonia Barcelona, Spain Email: gavalda@lsi.upc.edu Abstract—Unplanned system outages have a negative impact on company revenues and image. While the last decades have seen a lot of efforts from industry and academia to avoid them, they still happen and their impact is increasing. According to many studies, one of the most important causes of these outages is software aging. Software aging phenomena refers to the accumulation of errors, usually provoking resource con- tention, during long running application executions, like web applications, which normally cause applications/systems hang or crash. Determining the software aging root cause failure, not the resource or resources involved in, is a huge task due to the growing day by day complexity of the systems. In this paper we present a monitoring framework based on Aspect Programming to monitor the resources used by every application component in runtime. Knowing the resources used by every component of the application we can determine which components are related to the software aging. Furthermore, we present a case study where we evaluate our approach to determine in a web application scenario, which components are involved in the software aging with promising results. I. I NTRODUCTION Enterprise environments are rapidly changing, as new needs appear. Particularly, availability of the information at any time and everywhere is today a common requirement. To achieve these new challenges demanded by the industry and society, new IT infrastructures had to be created. Applications have to interact among each other and also with the environment in order to achieve these new goals, resulting in complex IT infrastructures that need brilliant IT professionals with hard- to-obtain skills to manage them. However, the complexity is achieving levels that even the best system administrators can hardly cope with it. A recent study [1] showed the average downtime or service degradation cost per hour for a typical enterprise is around US$125,000. Moreover, outages have a negative impact on the company image that could affect profits indirectly. Fur- thermore, it is known that currently, computer system outages are more often due to software faults, but not hardware [2], [3]. Several studies [4], [5], [6] showed that software aging phenomena is one of the sources of unavailability. This software aging phenomena refers to the accumulation of errors, usually provoking resource contention during long running application executions like web applications, which normally cause applications/systems hang or crash [7]. Grad- ual performance degradation could also accompany software aging phenomena. The software aging phenomena are often related to other phenomenas, such us memory bloating/leaks, unterminated threads, data corruption, unreleased file-locks and overruns. For this reason, applications have to deal with the soft- ware aging problem in production stage, making software rejuvenation techniques necessary [8]. Mainly, software reju- venation techniques are based on three main options: Sys- tem restarting, application restarting (partial rejuvenation) and node/application failover in a cluster system to become in a stable state. There are two basic rejuvenation strategies: Time-based and proactive-based strategies. In Time-based strategies, rejuvena- tion is applied regularly and periodically given a determined time interval. In fact, time-based strategies are widely used in real environments such web servers [9], [10]. On the other hand, proactive strategies system metrics are continuously monitored and the rejuvenation action is triggered when a crash or a system hang up happens being the software aging an evident probable cause. This approach is a better technique because if we can predict the crash and apply rejuvenation actions only in these cases, we reduce the number of rejuvenation actions with respect to the time-based approach. The effectiveness of these proactive strategies is based on the accuracy of the monitoring system used to collect the sys- tem metrics. However, the monitoring systems mainly collect system metrics understanding the applications as black boxes, becoming impossible to know which is the root cause of the software aging. Here we call root cause failure the application component guilty of the aging, usually a piece of software. We understand application component as the minimum piece of the application could be divided. For example: objects, servlets, EJB’s or others, depending on the technology used to develop the application. Traditionally, the monitoring systems are based on knowing the resource or resources involved in the software aging, however they cannot offer any clue or help to determinate the piece of software where the bug is placed. For this reason, the currently main rejuvenation strategy is