Efficient Barrier and Allreduce on InfiniBand Clusters using Hardware Multicast and Adaptive Algorithms Amith R Mamidala Jiuxing Liu Dhabaleswar K Panda Dept. of Computer Science and Engineering The Ohio State University Columbus, OH 43210 mamidala, liuj, panda @cse.ohio-state.edu Abstract Popular algorithms proposed in the literature for doing Barrier and Allreduce in clusters, such as pair-wise ex- change, dissemination and gather-broadcast do not give an optimal performance when there is skew among the nodes in the cluster. In pair-wise exchange and dissemination, all the nodes must arrive for the completion of each step. The gather-broadcast algorithm assumes a fixed tree topology. In this paper, we propose to use hardware multicast of In- finiBand in the design of an adaptive algorithm that per- forms well in the presence of skew. In this approach, the topology of the tree is not fixed but adapts depending on the skew. The last arriving node becomes the root of the tree if the skew is sufficiently large. We have carried out in-depth evaluation of our scheme and use synchronization delay as the performance metric for barrier and allreduce in the presence of skew. Our per- formance evaluation shows that our design scales very well with system size. Our designs can reduce the synchroniza- tion delay by a factor of 2.28 for Barrier and by a factor of 2.18 in the case of Allreduce. We have examined different skew scenarios and showed that the adaptive design per- forms either better or comparably to the existing schemes. 1. Introduction Clusters built from commodity PCs are increasingly be- ing used in the high performance computing arena. This is because they are very cost-effective and affordable. (MPI) [11] programming model has become the de-facto This research is supported in part by Department of Energy’s Grant #DE-FC02-01ER25506, a grant from Sandia National Laboratory, a grant from Los Alamos National Laboratory, and National Science Foundation’s grants #CCR-0204429 and #CCR-0311542. standard to develop parallel applications that deliver high performance. MPI provides both point-to-point and collec- tive communication functions. There are many applications which take advantage of these collective operations. Appli- cations such as IS and FT in the NAS Parallel Benchmark suite [9] use these collectives almost exclusively for com- munication. Thus, providing high performance and scalable collective communication support is critical for many clus- ter systems. Most of the current network interconnects provide fea- tures to support efficient collective communication. Re- cently, InfiniBand has been emerging as a powerful inter- connect technology. One of the notable features of Infini- Band is that it supports hardware multicast. By using this feature, a message can be sent to several nodes in an ef- ficient manner. Also, it has other important features like Remote Direct Memory Access (RDMA) operations. We can exploit these features to provide efficient and scalable collective operations over Infiniband clusters. In this paper, we focus on two important collective op- erations, MPI Barrier and MPI Allreduce. MPI Barrier is used as a synchronization call. Every process that has called a barrier blocks until all the participating processes have called this operation. In MPI Allreduce, each process sup- plies a vector of certain length which is fixed across all the processes. All these vectors are reduced to a single vector using the operator provided in the collective call. Finally, each process receives the resulting vector. Also note that there is an implicit synchronization of all the nodes partici- pating in the Allreduce. Many algorithms have been proposed in the literature for doing barrier and allreduce [4]. The most popular ones are the pair-wise exchange, dissemination and gather- broadcast. An implementation of these algorithms over In- finiBand is discussed in [3]. As InfiniBand clusters are becoming increasingly larger, one important factor to be 1