Pipelined Round-Robin Broadcast Algorithm in Homogeneous Clusters of SMP Shan Axida† Ta Quoc Viet‡ Tsutomu Yoshinaga† †University of Electro-Communications Email:axida@sowa.is.uec.ac.jp, yosinaga@is.uec.ac.jp ‡GLOBAL CYBERSOFT Inc. Email:Ta.Quocviet@jp.yokogawa.com 1. Introduction Homogeneous symmetric multiprocessor (SMP) clusters of workstations are widely used to perform high-performance computing. For such clusters to be more effective, communication must be carried out as efficiently as possible. Data broadcast is one of the most common collective communication operations. It is also the most essential task in parallel and distributed processing. In distributed matrix manipulations such as different algorithms for distributed matrix multiplication [1, 2, 3] or a distributed solution of linear equation systems [4], data broadcast over processor rows and columns occupies almost all of its communication time. Therefore, developing an effective broadcasting algorithm can clearly improve the overall performance. The broadcast operation requires a message from the root node (sender) to reach all other nodes in the system at the end of the operation. In this work, we define execution time as the duration between the time when the root starts sending and the time when the last machine receives the entire message. The effectiveness of broadcast algorithms depends on execution time. In our consideration, a large amount D of data is to be broadcast. If t is the time that a pair of nodes spends on sending and receiving D, this should be the theoretical limit of the broadcast operation. By splitting the data into several fragments and then distributing to the destinations them by a cyclic schema, the root can complete its task in that limited time t. At the same time, the destination nodes exchange the fragments of data that they have already received from the root by round-robin scheduling; they can also finish their tasks at almost the same point in time as the root. The aim is to ensure that all nodes are busy at every point in time and make use of all existing communication links during the entire process. As a result, our algorithm can reach the theoretical limit of execution time t. The rest of the paper is organized as follows. Section 2 briefly introduces related works. Section 3 defines the network model we analyze for our algorithm. Section 4 introduces the round-robin scheduling algorithm and its usage in global data exchange. Section 5 explains the pipelined round-robin broadcast algorithm in detail. Section 6 presents our experimental results. Finally, section 7 provides with our conclusions and future work. 2. Related Work We first consider the broadcast algorithm among Nnodes nodes. The root sends large data of size D to all other Nnodes-1 nodes. One of the simplest broadcast algorithms is based on a linear tree topology. In this method, node 0 sends data to node 1, node 1 sends data to node 2, and so on. Since the send and receive operations repeat Nnodes-1 times and each time the required time is t, the total cost of this algorithm, execution time, is (Nnodes-1)×t, which is higher than that of other proposed algorithms [14]. However, each node (except for the root and last node) spends only This study proposes a novel broadcast algorithm for large-sized data over symmetric multiprocessor (SMP) clusters. The algorithm is based on round-robin scheduling, and a pipelined data scattering pattern. It can salvage all available communication resources of systems at every point in time and is thereby capable of achieving approximately the theoretical limit of performance. This implies that for a large data size on a network with any even number of nodes, the broadcast execution time is approximately the time required for a node to send data to another node. We compare the performance of the algorithm with that of broadcast algorithms that are widely used in high-performance computing systems.