Copyright 2007 Page 1 of 7 Automating Change Detection and Impact Analysis Savita Angadi, Vijayanand Banahatti, Subhrojyoti Chaudhury, Ashutosh Chauhan, Harshal Hayatnagarkar, Sumit Mittal, Rakesh R, Prashant Ramakumar, Navneet Rao and Harrick Vin Systems Research Lab (SRL) Tata Research Development and Design Centre, Pune Tata Consultancy Services July 2007 E-mails of contact authors: s.angadi@tcs.com ; harrick.vin@tcs.com ABSTRACT IT plants operated by large enterprises evolve continuously to accommodate changes in infrastructure, application and workload. While the impact of some of these changes on the system parameters—such as application performance and availability—is instantaneous and perceptible, the impact of some other changes can be gradual and subtle. Today, although many IT plants collect and archive data about application performance as well as infrastructure utilization, many still rely on manual processes that involve “eyeballing” of measured data—to detect interesting changes and measure their impact on system parameters. A ‘holy- grail’ for many Chief Information Officers (CIOs) and IT plant operators that analyze this data is automatic change detection and impact analysis. In this paper, we describe the design and implementation of a system that takes the first steps in achieving this long-term objective. This solution was designed and implemented for a multi-national insurance provider with operations across many countries and continents. 1. INTRODUCTION Most large, multi-national enterprises today run IT plants that support hundreds of applications running on thousands of servers and databases. These plants evolve continuously to accommodate changes in infrastructure (which include hardware, Operating Systems (OS), and middleware upgrades), application, and workload. While the impact of some of these changes on the system parameters—such as application performance and availability—is instantaneous and perceptible, the impact of some other changes can be gradual and subtle. Consequently, the CIOs and the IT plant operators constantly prefer to remain updated on the following information: Percentage by which the performance of an application has degraded during the past X months Percentage by which the performance of an application has improved or deteriorated as of a certain date Changes (gradual or sudden) in the performance of a group of applications over a period A sudden change in system parameters is generally due to upgrades in infrastructure or application. Understanding the impact of such changes allows CIOs and IT plant operators to estimate the impact of similar changes in the future (and thereby enable proactive planning). Gradual and subtle changes in system parameters are often due to alterations in workload patterns (in terms of composition and volume of workload). By detecting these changes as early as possible, CIOs and IT plants can avoid outages and thereby improve overall system availability and performance. Today, although many IT plants collect and archive data about application performance as well as infrastructure utilization (through servers and databases), many still rely on manual processes—that involve “eyeballing” of measured data collected in Excel spreadsheets or other dashboards—to detect “interesting” changes and measure their impact on system parameters. Though such an approach can be effective for small IT plants, it does not scale to an increase in, size, diversity and distributed nature of IT plants. Thus, a ‘holy-grail’ for many CIOs and IT plant operators is to automate the process of change detection and impact analysis. However, this objective is a challenge for IT plants to achieve, because of the sheer size of data that the IT plants generate. Automating change detection is similar to “finding needles in the haystack”. This paper aims to describe the design and implementation of a system for solving this problem. This solution involves: efficient organization of metrics in on-line analytical processing (OLAP) cubes to facilitate easy extraction of time-series data; statistical and machine learning algorithms for finding trends and patterns in time-series data; and automatic generation of dashboards to present the system analytic Section 2 of this paper describes the problem setup in the context of a multi-national insurance provider. Section 3 describes the overall system architecture and the analysis methodology for achieving automation. Section 4 provides details of the solution prototype implementation as well as results of some of our analysis. Finally, Section 5 summarizes our contributions.