Efficient Broadcast on Computational Grids Gabriel Mateescu a and Ryan Taylor b a Information Management Sservices Branch, National Research Council Canada, 100 Sussex Drive, Ottawa ON K1A 0R6, Canada E-mail: gabriel.mateescu@nrc.gc.ca b School of Computer Scince, Carleton University, 1125 Colonel By Drive, Ottawa, K1S 5B6, Canada E-mail: rtaylor@scs.carleton.ca Collective communication operations such as broadcast, gather and reduce are potential performance bottlenecks for scientific computing software. With the advent of wide-area distributed computing and computational Grids, achieving efficient collective communication for ap- plications running on geographically distributed computers becomes of paramount importance. We propose an efficient broadcast algorithm, that is applicable to any connected network and show that it improves the performance of the MPI broadcast operation. Les op´ erations de communication inter-processus telles la diffusion (broadcast), la collecte des donn´ ees (gather), et les op´ erations math´ ematiques inter-noeuds (reduce) peuvent ais´ ement limiter la performance des codes de calcul scientifique. L’int´ erˆ et croissant que suscite le calcul g´ eographiquement distribu´ e de type grille rend d’autant plus primordial le d´ eveloppement de m´ ethodes collectives de communication efficaces. Nous proposons ici un algorithme de diffusion efficace des donn´ ees qui peut ˆ etre utilis´ e sur tout type de r´ eseau actif. Cet algorithme am´ eliore la performance de l’op´ eration MPI de diffusion. 1 Introduction Parallel and distributed scientific computing software ap- plications employ various forms of collective communica- tion, such as broadcast, reduce, scatter, and gather. Collec- tive communication is a potential performance bottlenecks, especially when the communication occurs over shared networks with large latencies. For computational Grids, achieving efficient collective communication for applica- tions spanning geographically distributed computers is of paramount importance. We propose an efficient broadcast algorithm, that is applicable to any connected network and show that its performance is competitive with that of the MPICH implementation [1,2] of the Message Passing In- terface [3]. We consider a set of networked computer resources, and represent the network as a directed graph G =(V,E), where V is the set of vertices, and E is the set of edges. Vertices represent computer nodes and edges represent the interconnect. With each edge (u, v) ∈ E, we associate the weight w(u, v) > 0, which represents the latency of the communication from vertex u to v. A message of size m is to be sent from a designated root to all the other vertices such that: (i) for each edge e =(u, v) ∈ E, it takes the time Δ(u) > 0, called the injection time for the node u to inject the message for delivery to v; (ii) a sender vertex can inject only a message at a time, i.e., point-to-point com- munication. The graph is strongly connected, i.e., for each pair of vertices u, v ∈ V , there is a path from u to v and from v to u. The problem is to to find the broadcast schedule that minimizes the time at which all the vertices have the mes- sage, where the broadcast schedule specifies the vertices that each vertex sends messages to, and the order in which the sends are performed. Notice that the problem contains undirected graphs as a special case. The problem can be shown to be NP-complete [4,5]. We present here an ap- proximate solution. 2 Related Work The published literature on collective communication for wide-area networks [6,7] typically follows the approach of dividing the network in a hierarchy of levels, and define collective operations in terms of collective operation within the levels. Often, a simplified view of the wide are network is assumed, in which point-to-point latencies are equal. An approach that supports general latencies has been proposed by Mandal, Kennedy and Mellor-Crummey [5]. The authors consider the special case when Δ is a constant, and employ Dijkstra’s algorithm together with a broadcast schedule derived from node labeling. Our approach has some common points with the method proposed in [5]. Unlike our method, the authors build the communication tree by applying the original Dijkstra’s algorithm. How- ever, simply using Dijkstra’s algorithm does not include the effect of the insertion time on the communication topology. This may produce a shortest path three that is far from the optimal communication topology, as shown by the example in Fig. 1, where all the edges have the weight of one. The single source shortest path tree (left graph, the short- est path edges are marked with arrows) determined assum- ing Δ=0 has the longest path of 1. However, if Δ is included, the length of the longest path is actually 1 + 5Δ, since all the messages are sent by vertex 0 and there is in- sertion delay of Δ for each message. The tree shown on the right side (the shortest path edges are marked with arrows) has the longest path 2 + 3Δ, which is better than 1 + 5Δ, when Δ > 0.5. Moreover, in the labeling stage of the al- gorithm, [5] uses an expensive method for computing the