Automated Simulation-Based Capacity Planning for Enterprise Data Fabrics Samuel Kounev, Konstantin Bender, Fabian Brosig and Nikolaus Huber Karlsruhe Institute of Technology, 76131 Karlsruhe, Germany {kounev,fabian.brosig,nikolaus.huber}@kit.edu Russell Okamoto VMware, Inc. 1260 NW Waterhouse Ave., Suite 200 Beaverton, Oregon 97006 russell@vmware.com ABSTRACT Enterprise data fabrics are gaining increasing attention in many industry domains including financial services, telecom- munications, transportation and health care. Providing a distributed, operational data platform sitting between ap- plication infrastructures and back-end data sources, enter- prise data fabrics are designed for high performance and scalability. However, given the dynamics of modern appli- cations, system sizing and capacity planning need to be done continuously during operation to ensure adequate quality-of- service and efficient resource utilization. While most prod- ucts are shipped with performance monitoring and analysis tools, such tools are typically focused on low-level profil- ing and they lack support for performance prediction and capacity planning. In this paper, we present a novel case study of a representative enterprise data fabric, the Gem- Fire EDF, presenting a simulation-based tool that we have developed for automated performance prediction and capac- ity planning. The tool, called Jewel, automates resource demand estimation, performance model generation, perfor- mance model analysis and results processing. We present an experimental evaluation of the tool demonstrating its effec- tiveness and practical applicability. 1. INTRODUCTION Enterprise data fabrics are becoming increasingly popular as an enabling technology for modern high-performance data intensive applications. Conceptually, an enterprise data fab- ric (EDF) represents a distributed enterprise middleware that leverages main memory and disk across multiple dis- parate hardware nodes to store, analyze, distribute, and replicate data. EDFs are relatively new type of enterprise middleware. Also referred to as “information fabrics” or “data grids”, EDFs can dramatically improve application performance and scalability. However, given the dynam- 9 This work was partially funded by the German Research Foundation (DFG) under grant No. KO 3445/6-1. Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Copyright 2010 ACM ...$10.00. ics of modern applications, to ensure adequate quality-of- service and efficient resource utilization, capacity planning needs to be done on a regular basis during operation. In to- day’s data centers, EDFs are typically hosted on dedicated server machines with over-provisioned capacity to guarantee adequate performance and availability at peak usage times. Servers in data centers nowadays typically have average uti- lization ranging from 5-20% [26, 29] which corresponds to their lowest energy-efficiency region [3]. To improve energy efficiency and reduce total-cost-of-ownership (TCO), capac- ity planning techniques are needed that help to estimate the amount of resources required for providing a given quality- of-service level at a reasonable cost. Such techniques should help to answer the following questions that arise frequently both at system deployment time and during operation: How many servers are needed to meet given response time and throughput service level agreements (SLAs)? How scalable is the EDF over N number of server nodes for a given workload type? What would be the average throughput and response time in steady state? What would be the utilization of the server nodes and of the network? What would be the performance bottleneck for a given scenario? How much would the system performance improve if the servers are upgraded? As the workload increases, how many additional servers would be needed? Answering such questions requires the ability to predict the system performance for a given workload scenario and capacity allocation. While many modeling and prediction techniques for distributed systems exist in the literature, most of them suffer from two significant drawbacks [17]. First, performance models are expensive to build and pro- vide limited support for reusability and customization. Sec- ond, performance models are static and maintaining them manually during operation is prohibitively expensive. Un- fortunately, building a predictive performance model manu- ally requires a lot of time and effort. Current performance analysis tools used in industry mostly focus on profiling and monitoring transaction response times and resource con- sumption. Such tools often provide large amounts of low- level data while important information needed for building performance models is missing, e.g., service resource de- mands.