Optimized Barriers for Heterogeneous Systems Using MPI Jan C. Meyer and Anne C. Elster Norwegian University of Science and Technology Dept. of Computer and Information Science Sem Sælands v. 7-9, NO-7491 Trondheim, Norway {janchris,elster}@idi.ntnu.no Abstract—The heterogeneous communication char- acteristics of clustered SMP systems create great potential for optimizations which favor physical lo- cality. This paper describes a novel technique for automating such optimizations, applied to barrier operations. Portability poses a challenge when op- timizing for locality, as costs are bound to variations in platform topology. This challenge is addressed through representing both platform structure and barrier algorithms as input data, and altering the algorithm based on benchmark results which can be easily obtained from a given platform. Our resulting optimization technique is empirically tested on two modern clusters, up to eight dual quad-core nodes on one, and up to ten dual hex-core nodes on another. Included test results show that the method captures performance advantages on both systems without any explicit customization, and produces specialized barriers of superior performance to a topology- neutral implementation. Keywords-topology; adaptive; barrier; MPI; I. I NTRODUCTION The increasing number of cores in parallel ar- chitectures introduce complexity in the intercon- nect, forcing designs to consider which levels of fast local memory should be private and shared, how to maintain coherency, and how to imple- ment communication with remote systems. This presents parallel applications with the challenges of a heterogeneous infrastructure for communica- tion. Recalling the historical development of cache- coherent NUMA machines at larger scale, a grow- ing core count coupled with a mixture of private and shared cache memories suggests a similar ten- dency towards shared resources with non-uniform access times even at the chip level. Our previous work on mutual exclusion [12] with both ccNUMA interconnects and multithreaded processors sug- gests that variations in signal latency caused by the physical locality of threads become significant already at modest scales, making it an important factor to control for efficient synchronization. This paper explores a method for automati- cally constructing signaling patterns which form cost-efficient barrier operations, in a scenario of controlled process locality, and highly variable point-to-point signal costs. Specifically, barriers are represented as incidence matrices of layered dependency graphs, which are coupled to matrices of point-to-point signal costs in order to derive a heuristic function for overall cost. This repre- sentation allows the internals of barrier operation to be automatically specialized to the underlying topology in ways which would require a handwrit- ten approach to rely on the topological details of the target platform. Testing the resulting, generated algorithms on clusters of dual quad-core and dual hex-core nodes show favorable performance com- pared to the common tree algorithm, even though that algorithm already strongly favors neighbor- hood locality. The rest of the paper is structured as follows: Section II describes related work. Section III pro- vides a high-level outline of our method, before Sections IV and V detail its topological and al- gorithmic aspects, respectively. Section VI shows that the combined model can predict algorithm and topology interactions well enough to guide optimization. Section VII proposes an automatic method to generate customized barrier algorithms for a profiled platform, and shows results with su- perior performance and scalability to the provided 2011 IEEE International Parallel & Distributed Processing Symposium 1530-2075/11 $26.00 © 2011 IEEE DOI 10.1109/IPDPS.2011.124 20 2011 IEEE International Parallel & Distributed Processing Symposium 1530-2075/11 $26.00 © 2011 IEEE DOI 10.1109/IPDPS.2011.124 20 2011 IEEE International Parallel & Distributed Processing Symposium 1530-2075/11 $26.00 © 2011 IEEE DOI 10.1109/IPDPS.2011.124 20 2011 IEEE International Parallel & Distributed Processing Symposium 1530-2075/11 $26.00 © 2011 IEEE DOI 10.1109/IPDPS.2011.124 20 2011 IEEE International Parallel & Distributed Processing Symposium 1530-2075/11 $26.00 © 2011 IEEE DOI 10.1109/IPDPS.2011.124 20 2011 IEEE International Parallel & Distributed Processing Symposium 1530-2075/11 $26.00 © 2011 IEEE DOI 10.1109/IPDPS.2011.124 20 2011 IEEE International Parallel & Distributed Processing Symposium 1530-2075/11 $26.00 © 2011 IEEE DOI 10.1109/IPDPS.2011.124 20