A Communication Benchmark Tailored to Intel
Broadwell Nodes and Tuned to the DEAC Cluster
Riana J. Freedman
Department of Computer Science
Wake Forest University
Winston-Salem, USA
riana.j.freedman@alumni.wfu.edu
Damian Valles
Ingram School of Engineering
Texas State University
San Marcos, USA
dvalles@txstate.edu
Abstract—Various benchmarks exist for assessing performance
characteristics of commodity hardware utilized in high perfor-
mance computing (HPC) cluster environments. An additional
assessment of bandwidth for Intel’s 44-core processors considers
network saturation via message passing focused on indirectly
connected cores. The benchmark developed in this work provides
a means of measuring bandwidth of two Broadwell nodes when
communicating inter-chassis with all cores of each node utilized.
This benchmark was developed in three phases using Message
Passing Interface (MPI). First, the bandwidth measure was
tested for MPI Send operations in point-to-point communication.
Second, the merge sort algorithm was implemented as a means
of assessing bandwidth and a tree-structured communication
algorithm was developed and implemented to maximize inter-
node communication and minimize intra-node communication.
Third, the benchmark was tuned to the configuration of the Dis-
tributed Environment for Academic Computing (DEAC) Cluster.
A second benchmark removing local computation from the merge
sort algorithm effectively performs a gather operation in a tree-
structure. These benchmarks were tested and compared with
relevant Intel MPI Benchmarks (IMB) tests. In the developed
benchmarks, the nodes requested significantly more bandwidth
than in the existing MPI Gather operation and IMB benchmarks
due to the large size of the messages simultaneously placed on
the network.
Index Terms—HPC, network, benchmark, merge sort, gather
I. I NTRODUCTION
Given the wealth of benchmarks available that may be
applied to commodity hardware to assess performance char-
acteristics, an additional assessment of bandwidth for Intel’s
44-core processors considers network saturation via message
passing focused on indirectly connected cores. A benchmark
was therefore developed to test the bandwidth of communi-
cation (measured in megabytes (MB) per second) between
two Broadwell nodes in a high performance computing (HPC)
environment. This benchmark considers scenarios when varied
message sizes are distributed over varied numbers of cores
in the two nodes such that the network is saturated. To
facilitate the development and assessment of this approach
to benchmarking Broadwell nodes, three phases were imple-
mented. First, the communication environment was prepared
and tested. Second, a variation of the merge sort algorithm
was implemented as a means of varying message size over
different numbers of cores to obtain bandwidth measures.
Third, the merge sort algorithm was tuned to the Wake For-
est University (WFU) Distributed Environment for Academic
Computing (DEAC) Cluster to identify potentially indirectly
connected cores. A tree-structured gather algorithm was also
implemented in place of merge sort. This second benchmark
provides a means of comparison given that the amount of local
computation is reduced.
Bandwidth is a measure of the amount of data that can
be sent over the network in a specified unit of time [1].
As such, bandwidth provides a performance metric for the
network. Given that a benchmark seeks to stress the limits of
a system to determine its maximum performance, this work
seeks to maximize bandwidth obtained in the presented bench-
mark. Benchmarks available for Message Passing Interface
(MPI) often focus on point-to-point communication to provide
bandwidth measurements, such as in the PingPong and Ping-
Ping benchmarks [2]. Much research has been done into the
performance metrics provided by the ping pong benchmark,
among others, to assess bandwidth in a system (for example,
see [3], [4]). Also, research has been performed regarding the
implementation of sorting algorithms in other benchmarks for
performance assessment. These benchmarks focus on a variety
of related aspects of the process, such as memory access
considering architectural design [5], [6] and speedup [7]. The
sorting algorithms implemented vary, as well. Implementations
of scatter and gather algorithms for optimization of memory
access [8] and for optimization of communication [9] have
been addressed. Optimization problems of this nature tend to
seek minimization of the measure addressed. The current work
seeks to optimize via maximization of bandwidth by tuning the
implementation to the Broadwell node hardware.
II. PRELIMINARY WORK
The set-up of the benchmark presented was completed in
two phases prior to tuning the measure to the Broadwell
hardware and to the DEAC Cluster. A distributed-memory
system is assumed given the configuration of the DEAC
Cluster.
A. Phase One: Point-to-Point Communication
With 44-cores available on each of two nodes, the initial
implementation required half of the available cores to send
978-1-5386-4649-6/18/$31.00 ©2018 IEEE 502