Partitioning Low-diameter Networks to Eliminate Inter-job Interference
Nikhil Jain
∗
, Abhinav Bhatele
∗
, Xiang Ni
†
, Todd Gamblin
∗
, Laxmikant V. Kale
‡
∗
Center for Applied Scientific Computing, Lawrence Livermore National Laboratory, Livermore, California 94551 USA
†
IBM Thomas J. Watson Research Center, Yorktown Heights, New York 10598 USA
‡
Department of Computer Science, University of Illinois at Urbana-Champaign, Urbana, Illinois 61801 USA
E-mail:
∗
{nikhil, bhatele, tgamblin}@llnl.gov,
†
xiang.ni@ibm.com,
‡
kale@illinois.edu
Abstract—On most supercomputers, except some torus net-
work based systems, resource managers allocate nodes to jobs
without considering the sharing of network resources by different
jobs. Such network-oblivious resource allocations result in link
sharing among multiple jobs that can cause significant perfor-
mance variability and performance degradation for individual
jobs. In this paper, we explore low-diameter networks and
corresponding node allocation policies that can eliminate inter-
job interference. We propose a variation to n-dimensional mesh
networks called express mesh. An express mesh is denser than the
corresponding mesh network, has a low diameter independent of
the number of routers, and is easily partitionable. We compare
structural properties and performance of express mesh with
other popular low-diameter networks. We present practical node
allocation policies for express mesh and fat-tree networks that not
only eliminate inter-job interference and performance variability,
but also improve overall performance.
Keywords-network topology; partitionability; inter-job interfer-
ence; express mesh; simulation;
I. I NTRODUCTION
Computational power of high performance computing (HPC)
systems has been increasing at a fast rate for several years. This
has led to network resources becoming a major performance
bottleneck when executing applications at extreme scales.
Low-diameter, high-radix networks such as fat-tree (FT) [1],
dragonfly (DF) [2], [3], and Slim Fly (SF) [4], are being
explored to cope with the scarcity of network resources.
Most resource managers allocate nodes to jobs using network-
oblivious schemes that maximize job throughout and system
utilization [5]. A major side-effect of such allocation schemes
is that while compute resources are dedicated to individual
jobs, network resources such as routers and links are shared
among multiple jobs. Further, non-minimal adaptive routing
in low-diameter network (LDN) topologies such as DF and
SF increases this sharing of resources and makes it harder to
allocate network resources exclusively to individual jobs.
Sharing of network resources among multiple jobs increases
network congestion and results in inter-job interference. Recent
studies have shown that inter-job interference causes significant
variation and degradation in observed performance of applica-
tions [6], [8]. Figure 1 presents one such example in which a
production application is run on a 5D torus based system (Mira)
and a dragonfly-based system (Edison) several times. On an
LDN such as DF in which network resources are shared (not
Figure 1: Run-to-run application performance variability in
current HPC systems [6], [7]
partitioned), up to 2× performance variability is observed. Per-
formance variability makes run-to-run comparisons of different
executions difficult and hampers the process of optimizing code
performance. In contrast, an easily partitionable torus network
provides consistent performance. In this paper, partitionability
refers to a property of the network that facilitates network-
aware allocation of nodes to jobs with a goal of minimizing
link sharing among jobs or partitions.
We address the problem of inter-job interference by attempt-
ing to answer the following: can a combination of network
topology (existing or new design) and node allocation policy
eliminate inter-job interference without losing the performance
achievable on shared low-diameter networks? To address this
challenge, we study the partitionability of three well-known
LDN topologies – DF, FT, and SF.
Mesh and torus networks result in lower performance as
compared to LDNs because of their large diameter. Even when
high-dimensional meshes and tori are used, the diameter in-
creases rapidly with node count, while the bisection bandwidth
does not increase as fast as that on LDNs such as DF and SF.
However, mesh and torus networks can be partitioned easily
to provide isolated allocations to each job. This results in
predictable performance on these systems as demonstrated by
the results on Mira in Figure 1. Driven by these observations,
we explore variations to mesh networks that can reduce network
diameter and improve performance, while retaining the ability
to provide interference-free node allocations to individual jobs.
The main contributions of this paper are:
2017 IEEE International Parallel and Distributed Processing Symposium
1530-2075/17 $31.00 © 2017 IEEE
DOI 10.1109/IPDPS.2017.91
439