Using Visualization to Support Network and Application Management in a Data Center Danyel Fisher, David A. Maltz, Albert Greenberg, Xiaoyu Wang, Heather Warncke, George Robertson, and Mary Czerwinski Abstract—We argue that there is a continuum between completely manual and completely automated management of networks and distributed applications. The ability to visualize the status of the network and applications inside a data center allows human users to rapidly assess the health of the system – quickly identifying problems that span across components and solving problems that challenge fully autonomic and machine-learning based management systems. We present a case study that highlights the requirements on visualization systems designed for the management of networks and applications inside large data centers. We explain the design of the Visual-I system that meets these requirements and the experience of an operations team using Visual-I. Index Terms—Data center, operations, dynamic configuration I. INTRODUCTION The dream of the fully autonomic data center—one that is expressed declaratively, maintained and healed automatically—is still elusive. For any but the simplest data centers, human operators need to periodically intervene: writing maintenance scripts, controlling updates, and modifying systems. Existing tools, such as those springing from a recovery oriented computing perspective, have had positive impact [10] but fall short. Self-repair capabilities have blind spots (e.g., heisenbugs), and encounter errors that they cannot control [13]. Experienced data center operators have found that any autonomic system has the potential to spin out of control— for example, half of all upgrades fail. Today, when situations arise that autonomic control systems do not handle, such as unexpected faults, application bugs, or emergent behaviors, data center operators are typically forced to revert to the simplest tools—command-line interfaces, point-and-click interfaces that generate long lists of servers, etc.. These primitive tools force humans to manually track and correlate many details: tasks they are not good at performing. We argue that there is a continuum between fully automated systems and fully manual systems, and that tools for visualization can span the gap between them. For example, visualization tools help operators understand the current state of the system, even when the system is in an inconsistent or an unusual state owing to an ongoing upgrade. Visualization can help operators discover correlated behaviors that are critical to debugging the system. Finally, visualization tools help deal with inconsistencies -- autonomic management tools have difficulty when the structure of the application does not fit the assumptions of the management system. A visualization tool allows operators to see the information they need and then direct the system correctly, without requiring changes to the assumptions of the management system. In this paper, we present a case study of a data center service that is managed via partial automation. We describe the tools that this service uses, and highlight the challenges facing a visualization system for managing data center services. We then briefly discuss Visual-I, our visualization framework that overcomes these challenges. Finally, we report how Visual-I was applied to the data center application considered, and initial operator experiences with Visual-I. Our goal was not to build a tool customized for one particular service; rather, it was to build a general tool for data center operations applicable to any cloud service (i.e., a data center application that provides services to users across the Internet). Working with a specific team allowed us to understand how that team creates, maintains, and uses a mental model of their server and network topology. Our results thus far provide a concrete data point on the state of the art in network and distributed system management inside data centers. II. PROBLEM DOMAIN Implementing a cloud service involves constructing a single coherent front-end interface out of a complex cluster of routers, switches, servers, and databases. Today, these distributed systems can be composed from on the order of 10,000 physical servers. Given the scale and complexity of the hardware and software, under normal operating conditions hundreds of hardware or software components may be in various degraded states: failing, undergoing upgrade, or failed. Though the larger cloud services build- in replication and resilience to cope with these conditions; nevertheless, things can and do go wrong; for example, Xiaoyu Wang is with University of North Carolina, Charlottes. This work performed while an intern at Microsoft Research. E-mail xwang25@uncc.edu Other authors are with Microsoft Research. E-mail {danyelf, dmaltz, albert, heathwar, ggr, marycz}@microsoft.com