A Communication Benchmark Tailored to Intel Broadwell Nodes and Tuned to the DEAC Cluster Riana J. Freedman Department of Computer Science Wake Forest University Winston-Salem, USA riana.j.freedman@alumni.wfu.edu Damian Valles Ingram School of Engineering Texas State University San Marcos, USA dvalles@txstate.edu Abstract—Various benchmarks exist for assessing performance characteristics of commodity hardware utilized in high perfor- mance computing (HPC) cluster environments. An additional assessment of bandwidth for Intel’s 44-core processors considers network saturation via message passing focused on indirectly connected cores. The benchmark developed in this work provides a means of measuring bandwidth of two Broadwell nodes when communicating inter-chassis with all cores of each node utilized. This benchmark was developed in three phases using Message Passing Interface (MPI). First, the bandwidth measure was tested for MPI Send operations in point-to-point communication. Second, the merge sort algorithm was implemented as a means of assessing bandwidth and a tree-structured communication algorithm was developed and implemented to maximize inter- node communication and minimize intra-node communication. Third, the benchmark was tuned to the configuration of the Dis- tributed Environment for Academic Computing (DEAC) Cluster. A second benchmark removing local computation from the merge sort algorithm effectively performs a gather operation in a tree- structure. These benchmarks were tested and compared with relevant Intel MPI Benchmarks (IMB) tests. In the developed benchmarks, the nodes requested significantly more bandwidth than in the existing MPI Gather operation and IMB benchmarks due to the large size of the messages simultaneously placed on the network. Index Terms—HPC, network, benchmark, merge sort, gather I. I NTRODUCTION Given the wealth of benchmarks available that may be applied to commodity hardware to assess performance char- acteristics, an additional assessment of bandwidth for Intel’s 44-core processors considers network saturation via message passing focused on indirectly connected cores. A benchmark was therefore developed to test the bandwidth of communi- cation (measured in megabytes (MB) per second) between two Broadwell nodes in a high performance computing (HPC) environment. This benchmark considers scenarios when varied message sizes are distributed over varied numbers of cores in the two nodes such that the network is saturated. To facilitate the development and assessment of this approach to benchmarking Broadwell nodes, three phases were imple- mented. First, the communication environment was prepared and tested. Second, a variation of the merge sort algorithm was implemented as a means of varying message size over different numbers of cores to obtain bandwidth measures. Third, the merge sort algorithm was tuned to the Wake For- est University (WFU) Distributed Environment for Academic Computing (DEAC) Cluster to identify potentially indirectly connected cores. A tree-structured gather algorithm was also implemented in place of merge sort. This second benchmark provides a means of comparison given that the amount of local computation is reduced. Bandwidth is a measure of the amount of data that can be sent over the network in a specified unit of time [1]. As such, bandwidth provides a performance metric for the network. Given that a benchmark seeks to stress the limits of a system to determine its maximum performance, this work seeks to maximize bandwidth obtained in the presented bench- mark. Benchmarks available for Message Passing Interface (MPI) often focus on point-to-point communication to provide bandwidth measurements, such as in the PingPong and Ping- Ping benchmarks [2]. Much research has been done into the performance metrics provided by the ping pong benchmark, among others, to assess bandwidth in a system (for example, see [3], [4]). Also, research has been performed regarding the implementation of sorting algorithms in other benchmarks for performance assessment. These benchmarks focus on a variety of related aspects of the process, such as memory access considering architectural design [5], [6] and speedup [7]. The sorting algorithms implemented vary, as well. Implementations of scatter and gather algorithms for optimization of memory access [8] and for optimization of communication [9] have been addressed. Optimization problems of this nature tend to seek minimization of the measure addressed. The current work seeks to optimize via maximization of bandwidth by tuning the implementation to the Broadwell node hardware. II. PRELIMINARY WORK The set-up of the benchmark presented was completed in two phases prior to tuning the measure to the Broadwell hardware and to the DEAC Cluster. A distributed-memory system is assumed given the configuration of the DEAC Cluster. A. Phase One: Point-to-Point Communication With 44-cores available on each of two nodes, the initial implementation required half of the available cores to send 978-1-5386-4649-6/18/$31.00 ©2018 IEEE 502