Partitioning Low-diameter Networks to Eliminate Inter-job Interference Nikhil Jain , Abhinav Bhatele , Xiang Ni , Todd Gamblin , Laxmikant V. Kale Center for Applied Scientific Computing, Lawrence Livermore National Laboratory, Livermore, California 94551 USA IBM Thomas J. Watson Research Center, Yorktown Heights, New York 10598 USA Department of Computer Science, University of Illinois at Urbana-Champaign, Urbana, Illinois 61801 USA E-mail: {nikhil, bhatele, tgamblin}@llnl.gov, xiang.ni@ibm.com, kale@illinois.edu Abstract—On most supercomputers, except some torus net- work based systems, resource managers allocate nodes to jobs without considering the sharing of network resources by different jobs. Such network-oblivious resource allocations result in link sharing among multiple jobs that can cause significant perfor- mance variability and performance degradation for individual jobs. In this paper, we explore low-diameter networks and corresponding node allocation policies that can eliminate inter- job interference. We propose a variation to n-dimensional mesh networks called express mesh. An express mesh is denser than the corresponding mesh network, has a low diameter independent of the number of routers, and is easily partitionable. We compare structural properties and performance of express mesh with other popular low-diameter networks. We present practical node allocation policies for express mesh and fat-tree networks that not only eliminate inter-job interference and performance variability, but also improve overall performance. Keywords-network topology; partitionability; inter-job interfer- ence; express mesh; simulation; I. I NTRODUCTION Computational power of high performance computing (HPC) systems has been increasing at a fast rate for several years. This has led to network resources becoming a major performance bottleneck when executing applications at extreme scales. Low-diameter, high-radix networks such as fat-tree (FT) [1], dragonfly (DF) [2], [3], and Slim Fly (SF) [4], are being explored to cope with the scarcity of network resources. Most resource managers allocate nodes to jobs using network- oblivious schemes that maximize job throughout and system utilization [5]. A major side-effect of such allocation schemes is that while compute resources are dedicated to individual jobs, network resources such as routers and links are shared among multiple jobs. Further, non-minimal adaptive routing in low-diameter network (LDN) topologies such as DF and SF increases this sharing of resources and makes it harder to allocate network resources exclusively to individual jobs. Sharing of network resources among multiple jobs increases network congestion and results in inter-job interference. Recent studies have shown that inter-job interference causes significant variation and degradation in observed performance of applica- tions [6], [8]. Figure 1 presents one such example in which a production application is run on a 5D torus based system (Mira) and a dragonfly-based system (Edison) several times. On an LDN such as DF in which network resources are shared (not                       Figure 1: Run-to-run application performance variability in current HPC systems [6], [7] partitioned), up to 2× performance variability is observed. Per- formance variability makes run-to-run comparisons of different executions difficult and hampers the process of optimizing code performance. In contrast, an easily partitionable torus network provides consistent performance. In this paper, partitionability refers to a property of the network that facilitates network- aware allocation of nodes to jobs with a goal of minimizing link sharing among jobs or partitions. We address the problem of inter-job interference by attempt- ing to answer the following: can a combination of network topology (existing or new design) and node allocation policy eliminate inter-job interference without losing the performance achievable on shared low-diameter networks? To address this challenge, we study the partitionability of three well-known LDN topologies – DF, FT, and SF. Mesh and torus networks result in lower performance as compared to LDNs because of their large diameter. Even when high-dimensional meshes and tori are used, the diameter in- creases rapidly with node count, while the bisection bandwidth does not increase as fast as that on LDNs such as DF and SF. However, mesh and torus networks can be partitioned easily to provide isolated allocations to each job. This results in predictable performance on these systems as demonstrated by the results on Mira in Figure 1. Driven by these observations, we explore variations to mesh networks that can reduce network diameter and improve performance, while retaining the ability to provide interference-free node allocations to individual jobs. The main contributions of this paper are: 2017 IEEE International Parallel and Distributed Processing Symposium 1530-2075/17 $31.00 © 2017 IEEE DOI 10.1109/IPDPS.2017.91 439