Machine Learning for On-Line Hardware Reconfiguration Jonathan Wildstrom * , Peter Stone, Emmett Witchel, Mike Dahlin Department of Computer Sciences The University of Texas at Austin {jwildstr,pstone,witchel,dahlin}@cs.utexas.edu Abstract As computer systems continue to increase in com- plexity, the need for AI-based solutions is becom- ing more urgent. For example, high-end servers that can be partitioned into logical subsystems and repartitioned on the fly are now becoming avail- able. This development raises the possibility of re- configuring distributed systems online to optimize for dynamically changing workloads. However, it also introduces the need to decide when and how to reconfigure. This paper presents one approach to solving this online reconfiguration problem. In particular, we learn to identify, from only low-level system statistics, which of a set of possible config- urations will lead to better performance under the current unknown workload. This approach requires no instrumentation of the system’s middleware or operating systems. We introduce an agent that is able to learn this model and use it to switch config- urations online as the workload varies. Our agent is fully implemented and tested on a publicly avail- able multi-machine, multi-process distributed sys- tem (the online transaction processing benchmark TPC-W). We demonstrate that our adaptive con- figuration is able to outperform any single fixed configuration in the set over a variety of work- loads, including gradual changes and abrupt work- load spikes. 1 Introduction The recent introduction of partitionable servers has enabled the potential for adaptive hardware reconfiguration. Proces- sors and memory can be added or removed from a system while incurring no downtime, even including single proces- sors to be shared between separate logical systems. While this allows for more flexibility in the operation of these sys- tems, it also raises the questions of when and how the system should be reconfigured. This paper establishes that automated adaptive hardware reconfiguration can significantly improve overall system performance when workloads vary. * currently employed by IBM Systems and Storage Group. Any opinions expressed in this paper may not necessarily be the opinions of IBM. Previous research [Wildstrom et al., 2005] has shown that the potential exists for performance to be improved through autonomous reconfiguration of CPU and memory resources. Specifically, that work showed that no single configuration is optimal for all workloads and introduced an approach to learning, based on low-level system statistics, which config- uration is most effective for the current workload, without di- rectly observing the workload. Although this work indicated that online reconfiguration should, in theory, improve perfor- mance, to the best of our knowledge it has not yet been estab- lished that online hardware reconfiguration actually produces a significant improvement in overall performance in practice. This paper demonstrates increased performance for a trans- action processing system using a learned model that reconfig- ures the system hardware online. Specifically, we train a ro- bust model of the expected performance of different hardware configurations, and then use this model to guide an online re- configuration agent. We show that this agent is able to make a significant improvement in performance when tested with a variety of workloads, as compared to static configurations. The remainder of this paper is organized as follows. The next section gives an overview of our experimental testbed. Section 3 details our methodology in handling unexpected workload changes, including the training of our agent (Sec- tion 3.1) and the experiments used to test the agent (Sec- tion 3.2). Section 4 contains the results of our experiments and some discussion of their implications. Section 5 gives an overview of related work, and Section 6 concludes. 2 Experimental Testbed and System Overview Servers that support partitioning into multiple logical subsys- tems (partitions) are now commercially available [Quintero et al., 2004]. Each partition has independent memory and pro- cessors available, enabling it to function as if it were an in- dependent physical machine. In this way, partitions (and ap- plications running on separate partitions) are prevented from interfering with each other through resource contention. Furthermore, these servers are highly flexible, both allow- ing different quantities of memory and processing resources to be assigned to partitions, as well as supporting the addition and removal of resources while the operating system contin- ues running. Hardware is also available that allows partition- ing on the sub-processor level; e.g., a partition can use as little as 1 10 of a physical processor on the hosting system [Quintero IJCAI-07 1113