Adaptive Profiling for Root-cause Analysis of Performance Anomalies in Web-based
Applications
Jo˜ ao Paulo Magalh˜ aes
CIICESI, ESTGF-Porto Polytechnic Institute
Felgueiras, Portugal 4610-156
Email: jpm@estgf.ipp.pt
Luis Moura Silva
CISUC, University of Coimbra
Coimbra, Portugal 3030-290
Email: luis@dei.uc.pt
Abstract—The most important factor in the assessment of
the availability of a system is the mean-time to repair (MTTR).
The lower the MTTR the higher the availability. A significant
portion of the MTTR is spent in the detection and localization
of the cause of the failure. One possible method that may
provide good results in the root-cause analysis of application
failures is run-time profiling. The major drawback of run-time
profiling is the performance impact.
In this paper we describe two algorithms for selective and
adaptive profiling of web-based applications. The algorithms
make use of a dynamic profiling interval and are mainly
triggered when some of the transactions start presenting some
symptoms of performance anomaly. The algorithms were tested
under different types of degradation scenarios and compared to
static sampling strategies. We observed through experimenta-
tion that the pinpoint of performance anomalies, supported by
the data collected using the adaptive profiling algorithms, stills
timely as with full-profiling while the response time overhead
is reduced in almost 60%. When compared to a non-profiled
version the response time overhead is less than 1.5%. These
results show the viability of using run-time profiling to support
quickly detection and pinpointing of performance anomalies
and enable timely recovery.
Keywords-application profiling; monitoring; root-cause anal-
ysis; performance anomalies; dependability
I. I NTRODUCTION
The response time is a crucial aspect for companies, which
depends on web applications for most of their revenue.
Recently Bojan Simic presented in [1] the results of his
latest research. He found that website slowdowns can have
twice the revenue impact on an organization as an outage.
According him the average revenue loss for one hour of
website downtime is $21000 while the average revenue loss
of an hour of website slowdown is estimated in $4100,
however website slowdowns may occur 10 times more
frequently than website outages. Likely, according a recent
report provided by the Aberdeen Group [2], a delay of just 1-
second in page load time can represent a loss of $2.5 million
in sales per year for a site that typically earns $100.000 a
day.
Developers are aware of these issues and as part of the de-
velopment cycle they adopt application profiling to identify
where the overwhelming of system resources burdens are,
and to suppress them. While essential to improve the ap-
plications performance such off-line analysis do not capture
run-time performance anomalies where, according the Fail-
Stutter fault model [3], some of the application components
can start performing differently leading to performance-
faulty scenarios.
In [4] authors estimate that 75% of the time to recover
from application-level failures is spent just detecting and lo-
calizing them. Quickly detection and localization is intended
as a main contribution to reduce the MTTR (mean-time-to-
recovery) and so improve the service reliability.
In this context, run-time application profiling is extremely
important to provide timely detection of abnormal execution
patterns, pinpoint the faulty components and allow quickly
recovery. Is common sense that the more specific the profil-
ing is, the more precise the analysis it allows. However,
collect detailed data in run-time from across the entire
application can introduce an overhead incompatible with the
performance level required for the application. In past work
[5] we developed some techniques for root-cause failure
analysis and failure prediction that make use of Aspect-
Oriented-Programming (AOP) to do run-time monitoring of
the application components and system values. The results
were very sound but to avoid the AOP-based profiling
overhead (around 60%) we adopted a static profiling sam-
pling strategy. Such approach might not optimize the time
required for localization, so we need to work further in the
profiling algorithms to improve the time required to pinpoint
the faulty components as well to minimize the profiling
overhead.
In this paper we propose two adaptive and selective
algorithms to profile web-based or component-based ap-
plications. The usefulness of such adaptive algorithms for
application profiling encompasses several challenges. In this
paper we focus algorithms suitable to:
• reduce the performance impact;
• allow to timely pinpoint the root-cause of performance
anomalies;
• minimize the number of end-users suffering from the
performance anomalies effects;
• guarantee that application profiling is not itself con-
2011 IEEE International Symposium on Network Computing and Applications
978-0-7695-4489-2/11 $26.00 © 2011 IEEE
DOI 10.1109/NCA.2011.30
163
2011 IEEE International Symposium on Network Computing and Applications
978-0-7695-4489-2/11 $26.00 © 2011 IEEE
DOI 10.1109/NCA.2011.30
171