Adaptive Profiling for Root-cause Analysis of Performance Anomalies in Web-based Applications Jo˜ ao Paulo Magalh˜ aes CIICESI, ESTGF-Porto Polytechnic Institute Felgueiras, Portugal 4610-156 Email: jpm@estgf.ipp.pt Luis Moura Silva CISUC, University of Coimbra Coimbra, Portugal 3030-290 Email: luis@dei.uc.pt Abstract—The most important factor in the assessment of the availability of a system is the mean-time to repair (MTTR). The lower the MTTR the higher the availability. A significant portion of the MTTR is spent in the detection and localization of the cause of the failure. One possible method that may provide good results in the root-cause analysis of application failures is run-time profiling. The major drawback of run-time profiling is the performance impact. In this paper we describe two algorithms for selective and adaptive profiling of web-based applications. The algorithms make use of a dynamic profiling interval and are mainly triggered when some of the transactions start presenting some symptoms of performance anomaly. The algorithms were tested under different types of degradation scenarios and compared to static sampling strategies. We observed through experimenta- tion that the pinpoint of performance anomalies, supported by the data collected using the adaptive profiling algorithms, stills timely as with full-profiling while the response time overhead is reduced in almost 60%. When compared to a non-profiled version the response time overhead is less than 1.5%. These results show the viability of using run-time profiling to support quickly detection and pinpointing of performance anomalies and enable timely recovery. Keywords-application profiling; monitoring; root-cause anal- ysis; performance anomalies; dependability I. I NTRODUCTION The response time is a crucial aspect for companies, which depends on web applications for most of their revenue. Recently Bojan Simic presented in [1] the results of his latest research. He found that website slowdowns can have twice the revenue impact on an organization as an outage. According him the average revenue loss for one hour of website downtime is $21000 while the average revenue loss of an hour of website slowdown is estimated in $4100, however website slowdowns may occur 10 times more frequently than website outages. Likely, according a recent report provided by the Aberdeen Group [2], a delay of just 1- second in page load time can represent a loss of $2.5 million in sales per year for a site that typically earns $100.000 a day. Developers are aware of these issues and as part of the de- velopment cycle they adopt application profiling to identify where the overwhelming of system resources burdens are, and to suppress them. While essential to improve the ap- plications performance such off-line analysis do not capture run-time performance anomalies where, according the Fail- Stutter fault model [3], some of the application components can start performing differently leading to performance- faulty scenarios. In [4] authors estimate that 75% of the time to recover from application-level failures is spent just detecting and lo- calizing them. Quickly detection and localization is intended as a main contribution to reduce the MTTR (mean-time-to- recovery) and so improve the service reliability. In this context, run-time application profiling is extremely important to provide timely detection of abnormal execution patterns, pinpoint the faulty components and allow quickly recovery. Is common sense that the more specific the profil- ing is, the more precise the analysis it allows. However, collect detailed data in run-time from across the entire application can introduce an overhead incompatible with the performance level required for the application. In past work [5] we developed some techniques for root-cause failure analysis and failure prediction that make use of Aspect- Oriented-Programming (AOP) to do run-time monitoring of the application components and system values. The results were very sound but to avoid the AOP-based profiling overhead (around 60%) we adopted a static profiling sam- pling strategy. Such approach might not optimize the time required for localization, so we need to work further in the profiling algorithms to improve the time required to pinpoint the faulty components as well to minimize the profiling overhead. In this paper we propose two adaptive and selective algorithms to profile web-based or component-based ap- plications. The usefulness of such adaptive algorithms for application profiling encompasses several challenges. In this paper we focus algorithms suitable to: reduce the performance impact; allow to timely pinpoint the root-cause of performance anomalies; minimize the number of end-users suffering from the performance anomalies effects; guarantee that application profiling is not itself con- 2011 IEEE International Symposium on Network Computing and Applications 978-0-7695-4489-2/11 $26.00 © 2011 IEEE DOI 10.1109/NCA.2011.30 163 2011 IEEE International Symposium on Network Computing and Applications 978-0-7695-4489-2/11 $26.00 © 2011 IEEE DOI 10.1109/NCA.2011.30 171