Mitigating System Noise With Simultaneous Multi-Threading * Eli Rosenthal Edgar A. Le´on Adam T. Moody Brown University Lawrence Livermore National Laboratory eli rosenthal@brown.edu {leon,moody20}@llnl.gov ABSTRACT System noise on commodity clusters significantly impacts the scalability of scientific applications. As the number of nodes in a cluster grows, this impact outweighs the benefits of the increased system capability. In this work, we propose using Simultaneous Multi-Threading (SMT) to mitigate sys- tem noise, and we demonstrate its effectiveness using Fixed Work Quantum (FWQ) micro-benchmarks. This technique relies on using the floating-point and integer units of a core (shared SMT resources) for system processing when an ap- plication is stalled waiting for an out-of-core operation. Un- like core-specialization, which dedicates a core for system processing, our approach allows an application to use all cores of a node and proves significantly more effective in reducing system noise. 1. INTRODUCTION The root cause of performance degradation for many sci- entific applications at scale stems from interference with sys- tem processes that run on each compute node. Because the application has little control over when and where these sys- tem processes run, the interference appears as noise to the application. Noise impacts a large number of scientific appli- cations including global climate modeling, heart and organ simulations, visualization of mixing fluids, and detailed pre- dictions of ecosystems. Significant research and development over the last ten years [2] have virtually eliminated noise on leadership capa- bility systems like Sequoia at Lawrence Livermore National Laboratory (LLNL). These gains, however, come with high costs associated with the research, development, and main- tenance of specialized hardware and a radically simplified operating system (OS). Commodity systems, on the other hand, use inexpensive hardware and software designed for mass consumer markets. Since mass consumers do not com- monly build large scale systems, designs for this hardware and software may disregard performance problems that arise at scale. Noise consumes intolerable amounts of execution time on today’s commodity systems, and it will reach critical proportions in next-generation clusters if left unchecked. 2. IDENTIFYING NOISE The effect of system noise on application performance de- pends on the schedule at which system processes execute across nodes [1]. The scalability of applications may im- prove if these processes are co-scheduled at the same time across the cluster. Thus, we first study whether system pro- cesses occur synchronously or asynchronously on a produc- tion Linux cluster at LLNL (cab ). Compute nodes are com- * This work performed under the auspices of the U.S. Depart- ment of Energy by Lawrence Livermore National Laboratory under Contract DE-AC52-07NA27344. LLNL-ABS-641767. prised of two Intel Sandy Bridge sockets, with 8 cores per socket, and interconnected with an InfiniBand QDR fabric. We use FWQ to quantify the impact of noise within each node and implement a global synchronous clock to correlate noise events across nodes. FWQ executes a fixed amount of work for a specified number of iterations and records the runtime (in processor cycles) of each iteration. This bench- mark shows areas where the number of cycles to complete the same amount of wok is much higher, thereby indicating the impact of noise. Our global clock is based on Jones’ synchronization algorithm [1]. We use cab for all of our ex- periments in this work. Figure 1 shows a histogram of the duration of each FWQ iteration across time on 4 nodes with brighter colors indi- cating more populated bins. The median duration of a unit of work on a single core is about 1.02 ms. The spikes in this figure indicate a large amount of synchronous noise. At 256 nodes, noise is not synchronous as shown by Figure 2; areas of increased noise are broader and less distinct. Thus, with- out significant changes to the Linux kernel to co-schedule services (assuming they can be co-scheduled), system pro- cesses execute asynchronously across nodes, which increases the effect of noise on application performance. Figure 1: Work duration per iteration across time on 4 nodes. Well-defined spikes show areas of syn- chronous noise. 3. MITIGATING NOISE WITH SMT Known methods of mitigating noise include dedicating one core per node for system tasks (core specialization). In this work, we propose using one thread of execution per core for processing system tasks on SMT architectures. SMT masks latencies of out-of-core operations. Thus, while an applica- tion’s threads are stalled on a core, a different thread can process system tasks using the floating-point and integer