Using Access Logs To Detect Application-Level Failures Greg Friedman , Peter Bod´ ık , Lukas Biewald , HT Levine § , George Candea Stanford University, UC Berkeley, § Ebates.com December 13, 2004 Abstract Most Web applications suffer from application-level faults that decrease their availability. A majority of the time spent recovering from these faults is spent detecting and diagnosing them, and in some cases this can take weeks or even months. We propose failure detection and diagnosis by a combination of an automatic anomaly detection in the HTTP access logs and a visual monitoring tool. Our approach is generic and can be applied to any web ap- plication without any special instrumentation and with minimal parameterization. Evaluation performed using HTTP logs from Ebates.com shows that we can detect and localize failures that occurred at this site. 1 Introduction Web applications are becoming increasingly complex, while at the same time the demands on reliability and availability of these systems are more and more strict. Many Web applications suffer from application-level faults that cause user-visible failures or downtime of the site. As reported in [KF04], as much as 75% of time spent recovering from these failures is spent just detecting them. There are several ex- isting and widely-deployed techniques for detecting failures ([IBM], [HP], [Alt]), but they typically only detect problems that manifest themselves in measur- able system metrics (rather than in changes to user behavior). Other techniques for detecting more sub- tle application-level failures (such as [CKF + ]) require instrumentation of the application and/or the mid- dleware. While the previously mentioned approaches moni- tor the web applications and the servers, the perfect detector of failures are the users of the web site. They quickly notice that the site is down or that a page is unavailable. But they usually don’t let the opera- tors know about the problem they detected; mostly they just leave the site. However, their behavior is likely to change in response to a failure: for exam- ple, if the link from the /shopping cart page to the /checkout page is broken, users simply can’t reach the /checkout page. We can’t ask the users about possible problems at the site nor can we directly observe their actions. However, to take advantage of users detecting fail- ures for us, we can simply observe changes or anoma- lies in their behavior at the web site. Users’ behav- ior is well captured in the HTTP access logs at the web servers; we can reconstruct access patterns to all pages and sequences of pages in sessions of individual users. By modeling typical user behavior and access patterns, we can notice anomalies that often corre- spond to problems on the web site. After detecting an anomaly, we also detect the most anomalous pages and significant changes in page transitions. This in- formation helps the operator to quickly detect and diagnose a possible failure. In addition to serious failures that cause signifi- cant anomalies in the HTTP logs, this approach has the potential to also detect much more subtle prob- lems. It took weeks or even months to detect many of the problems that occurred at Ebates.com. An ex- ample of such a problem is a particular HTML link that worked on all Internet browsers except a spe- cific version of Netscape Navigator. These problems are almost impossible to detect by analyzing just the daily traffic; we need to consider much longer time periods. Another reason for detecting failures just by using the HTTP access logs is that these logs are readily available on all web sites. This approach can thus be applied to other web sites without any instrumenta- tion of the web applications. Contribution We designed and implemented three on-line algo- rithms that use HTTP access logs to detect anomalies in user access patterns to web applications. These anomalies are used to detect failures in the applica- tion and to localize the most anomalous pages. We