Optimized Barriers for Heterogeneous Systems Using MPI
Jan C. Meyer and Anne C. Elster
Norwegian University of Science and Technology
Dept. of Computer and Information Science
Sem Sælands v. 7-9, NO-7491 Trondheim, Norway
{janchris,elster}@idi.ntnu.no
Abstract—The heterogeneous communication char-
acteristics of clustered SMP systems create great
potential for optimizations which favor physical lo-
cality. This paper describes a novel technique for
automating such optimizations, applied to barrier
operations. Portability poses a challenge when op-
timizing for locality, as costs are bound to variations
in platform topology. This challenge is addressed
through representing both platform structure and
barrier algorithms as input data, and altering the
algorithm based on benchmark results which can be
easily obtained from a given platform. Our resulting
optimization technique is empirically tested on two
modern clusters, up to eight dual quad-core nodes on
one, and up to ten dual hex-core nodes on another.
Included test results show that the method captures
performance advantages on both systems without
any explicit customization, and produces specialized
barriers of superior performance to a topology-
neutral implementation.
Keywords-topology; adaptive; barrier; MPI;
I. I NTRODUCTION
The increasing number of cores in parallel ar-
chitectures introduce complexity in the intercon-
nect, forcing designs to consider which levels of
fast local memory should be private and shared,
how to maintain coherency, and how to imple-
ment communication with remote systems. This
presents parallel applications with the challenges
of a heterogeneous infrastructure for communica-
tion. Recalling the historical development of cache-
coherent NUMA machines at larger scale, a grow-
ing core count coupled with a mixture of private
and shared cache memories suggests a similar ten-
dency towards shared resources with non-uniform
access times even at the chip level. Our previous
work on mutual exclusion [12] with both ccNUMA
interconnects and multithreaded processors sug-
gests that variations in signal latency caused by
the physical locality of threads become significant
already at modest scales, making it an important
factor to control for efficient synchronization.
This paper explores a method for automati-
cally constructing signaling patterns which form
cost-efficient barrier operations, in a scenario of
controlled process locality, and highly variable
point-to-point signal costs. Specifically, barriers
are represented as incidence matrices of layered
dependency graphs, which are coupled to matrices
of point-to-point signal costs in order to derive
a heuristic function for overall cost. This repre-
sentation allows the internals of barrier operation
to be automatically specialized to the underlying
topology in ways which would require a handwrit-
ten approach to rely on the topological details of
the target platform. Testing the resulting, generated
algorithms on clusters of dual quad-core and dual
hex-core nodes show favorable performance com-
pared to the common tree algorithm, even though
that algorithm already strongly favors neighbor-
hood locality.
The rest of the paper is structured as follows:
Section II describes related work. Section III pro-
vides a high-level outline of our method, before
Sections IV and V detail its topological and al-
gorithmic aspects, respectively. Section VI shows
that the combined model can predict algorithm
and topology interactions well enough to guide
optimization. Section VII proposes an automatic
method to generate customized barrier algorithms
for a profiled platform, and shows results with su-
perior performance and scalability to the provided
2011 IEEE International Parallel & Distributed Processing Symposium
1530-2075/11 $26.00 © 2011 IEEE
DOI 10.1109/IPDPS.2011.124
20
2011 IEEE International Parallel & Distributed Processing Symposium
1530-2075/11 $26.00 © 2011 IEEE
DOI 10.1109/IPDPS.2011.124
20
2011 IEEE International Parallel & Distributed Processing Symposium
1530-2075/11 $26.00 © 2011 IEEE
DOI 10.1109/IPDPS.2011.124
20
2011 IEEE International Parallel & Distributed Processing Symposium
1530-2075/11 $26.00 © 2011 IEEE
DOI 10.1109/IPDPS.2011.124
20
2011 IEEE International Parallel & Distributed Processing Symposium
1530-2075/11 $26.00 © 2011 IEEE
DOI 10.1109/IPDPS.2011.124
20
2011 IEEE International Parallel & Distributed Processing Symposium
1530-2075/11 $26.00 © 2011 IEEE
DOI 10.1109/IPDPS.2011.124
20
2011 IEEE International Parallel & Distributed Processing Symposium
1530-2075/11 $26.00 © 2011 IEEE
DOI 10.1109/IPDPS.2011.124
20