Machine Learning for On-Line Hardware Reconﬁguration Jonathan Wildstrom * , Peter Stone, Emmett Witchel, Mike Dahlin Department of Computer Sciences The University of Texas at Austin {jwildstr,pstone,witchel,dahlin}@cs.utexas.edu Abstract As computer systems continue to increase in com- plexity, the need for AI-based solutions is becom- ing more urgent. For example, high-end servers that can be partitioned into logical subsystems and repartitioned on the ﬂy are now becoming avail- able. This development raises the possibility of re- conﬁguring distributed systems online to optimize for dynamically changing workloads. However, it also introduces the need to decide when and how to reconﬁgure. This paper presents one approach to solving this online reconﬁguration problem. In particular, we learn to identify, from only low-level system statistics, which of a set of possible conﬁg- urations will lead to better performance under the current unknown workload. This approach requires no instrumentation of the system’s middleware or operating systems. We introduce an agent that is able to learn this model and use it to switch conﬁg- urations online as the workload varies. Our agent is fully implemented and tested on a publicly avail- able multi-machine, multi-process distributed sys- tem (the online transaction processing benchmark TPC-W). We demonstrate that our adaptive con- ﬁguration is able to outperform any single ﬁxed conﬁguration in the set over a variety of work- loads, including gradual changes and abrupt work- load spikes. 1 Introduction The recent introduction of partitionable servers has enabled the potential for adaptive hardware reconﬁguration. Proces- sors and memory can be added or removed from a system while incurring no downtime, even including single proces- sors to be shared between separate logical systems. While this allows for more ﬂexibility in the operation of these sys- tems, it also raises the questions of when and how the system should be reconﬁgured. This paper establishes that automated adaptive hardware reconﬁguration can signiﬁcantly improve overall system performance when workloads vary. * currently employed by IBM Systems and Storage Group. Any opinions expressed in this paper may not necessarily be the opinions of IBM. Previous research [Wildstrom et al., 2005] has shown that the potential exists for performance to be improved through autonomous reconﬁguration of CPU and memory resources. Speciﬁcally, that work showed that no single conﬁguration is optimal for all workloads and introduced an approach to learning, based on low-level system statistics, which conﬁg- uration is most effective for the current workload, without di- rectly observing the workload. Although this work indicated that online reconﬁguration should, in theory, improve perfor- mance, to the best of our knowledge it has not yet been estab- lished that online hardware reconﬁguration actually produces a signiﬁcant improvement in overall performance in practice. This paper demonstrates increased performance for a trans- action processing system using a learned model that reconﬁg- ures the system hardware online. Speciﬁcally, we train a ro- bust model of the expected performance of different hardware conﬁgurations, and then use this model to guide an online re- conﬁguration agent. We show that this agent is able to make a signiﬁcant improvement in performance when tested with a variety of workloads, as compared to static conﬁgurations. The remainder of this paper is organized as follows. The next section gives an overview of our experimental testbed. Section 3 details our methodology in handling unexpected workload changes, including the training of our agent (Sec- tion 3.1) and the experiments used to test the agent (Sec- tion 3.2). Section 4 contains the results of our experiments and some discussion of their implications. Section 5 gives an overview of related work, and Section 6 concludes. 2 Experimental Testbed and System Overview Servers that support partitioning into multiple logical subsys- tems (partitions) are now commercially available [Quintero et al., 2004]. Each partition has independent memory and pro- cessors available, enabling it to function as if it were an in- dependent physical machine. In this way, partitions (and ap- plications running on separate partitions) are prevented from interfering with each other through resource contention. Furthermore, these servers are highly ﬂexible, both allow- ing different quantities of memory and processing resources to be assigned to partitions, as well as supporting the addition and removal of resources while the operating system contin- ues running. Hardware is also available that allows partition- ing on the sub-processor level; e.g., a partition can use as little as 1 10 of a physical processor on the hosting system [Quintero IJCAI-07 1113