On Balancing Traffic Load in Path-Based Multicast Communication A. Al-Dubai, M. Ould-Khaoua, K. El-Zayyat*, and L. M. Mackenzie Department of Computing Science University of Glasgow Glasgow G12 8QQ, UK. E-mail: {aldubai, mohamed, lewis}@dcs.gla.ac.uk *School of Computer Science, Telecommunications and Information Systems DePaul University, USA, Email: kelzayyat@cs.depaul.edu Abstract Multicast is the most primitive collective capability of any message-passing network. It is itself central to many important parallel applications in Science and Engineering but are also fundamental to the implementation of higher-level communication operations such as gossip, gather, and barrier synchronisation. This paper presents a new efficient multicast path-based algorithm, which can achieve a high degree of parallelism and low communication latency over a wide range of traffic loads in the mesh. To achieve this, the proposed algorithm relies on a new approach that divides the destinations in a way that balances the traffic load on network channels during the propagation of the multicast message. Results from extensive simulations under a variety of working conditions confirm that the proposed algorithm exhibits superior performance characteristics over those of some well-known existing algorithms, such as dual-path, multiple-path, and column-path algorithms. 1. Introduction Optimising the performance of message-passing multicomputers requires matching inter-processor communication algorithms and application characteristics to a suitable underlying interconnection network. The mesh has been one of the most popular interconnection networks for multicomputers due to many desirable properties, such as ease of implementation, recursive structure, and an ability to exploit the communication locality found in many parallel applications to reduce message latency [1, 2, 9, 15, 20, 21, 22]. The switching method determines the way messages visit intermediate nodes. Wormhole switching has been widely used in practice due firstly to its low buffering requirements, allowing for efficient router implementation. Secondly, and more importantly, it makes latency almost independent of the message distance in the absence of blocking. In wormhole switching, a message is divided into elementary units called flits, each of a few bytes for transmission and flow control. The header flit (containing routing information) governs the route and the remaining data flits follow it in a pipelined fashion. If a channel transmits the header of a message, it must transmit all the remaining flits of the same message before transmitting flits of another message. When the header is blocked the data flits are blocked in-situ. Multicast, which refers to the delivery of a message disseminated from a given source to a group of destinations, is one of the most important collective communication operations. It is often required in many scientific computations to distribute large data arrays over system nodes in order, for example, to perform various data manipulation operations. It is also required in control operations such as global synchronisation and to signal changes in network conditions, e.g., faults, image processing, matrix multiplication and graphics on parallel computers. Multicast latency consists of three components, start-up latency, network latency and blocking latency [3, 4, 12, 14, 18]. The start-up latency is the time incurred by the operating system when preparing a message for injection into the network. The network latency consists of channel propagation and router delays, while blocking latency accounts for delays due message contention over network resources, e.g. buffers and channels. In current generation machines, the start-up latency is the dominating factor in the cost of communication, being typically in the order of microseconds compared to the network latency, which is in the order of nanoseconds [18]. Blocking latency, on the other hand, depends on the routing algorithm and the generated traffic, and consequently can vary widely depending on instantaneous traffic conditions. Due to the dominance of the start-up latency, much research work has been devoted to the ISBN: 1-56555-269-5 533 SPECTS '03