Cluster Computing 3 (2000) 25–34 25 Block-cyclic redistribution over heterogeneous networks ∗ Prashanth B. Bhat a , Viktor K. Prasanna a and C.S. Raghavendra b a Department of EE-Systems, University of Southern California, Los Angeles, CA 90089-2562, USA b The Aerospace Corporation, Los Angeles, CA 90009, USA Clusters of workstations and networked parallel computing systems are emerging as promising computational platforms for HPC applications. The processors in such systems are typically interconnected by a collection of heterogeneous networks such as Ethernet, ATM, and FDDI, among others. In this paper, we develop techniques to perform block-cyclic redistribution over P processors intercon- nected by such a collection of heterogeneous networks. We represent the communication scheduling problem using a timing diagram formalism. Here, each interprocessor communication event is represented by a rectangle whose height denotes the time to perform this event over the heterogeneous network. The communication scheduling problem is then one of appropriately positioning the rectangles so as to minimize the completion time of all the communication events. For the important case where the block size changes by a factor of K, we develop a heuristic algorithm whose completion time is at most twice the optimal. The running time of the heuristic is O(PK 2 ). Our heuristic algorithm is adaptive to variations in network performance, and derives schedules at run-time, based on current information about the available network bandwidth. Our experimental results show that our schedules always have communication times that are very close to optimal. 1. Introduction Due to advances in high-speed networks, workstation clusters and loosely connected distributed systems are being used as platforms for High Performance Computing. Wide area networking technology has also enabled the develop- ment of metacomputers [13], wherein grand challenge ap- plications are parallelized across geographically distributed supercomputers and visualization devices. Such distributed systems are typically interconnected with a collection of many different kinds of communication networks, such as ATM, HiPPI, and Ethernet. Prototype systems with such heterogeneous networks have been built. For example, Kim and Lilja [5,6] eval- uated the performance of HPC applications on a cluster of workstations interconnected with ATM, Ethernet, and FDDI networks. The performance characteristics of each of the networks were first evaluated by sending messages of var- ious sizes over the particular network. These characteris- tics were then used to choose a suitable technique for data communication. The Performance Based Path Selection (PBPS) technique selects one of the networks for a given communication event, depending on the size of the mes- sage. The Aggregation technique uses multiple networks at the same time, by breaking up the message into multiple parts and sending these parts over different networks. The I-WAY (Information Wide Area Year) metacom- puter at SC ’95 [7] consisted of over 10 networks of vary- ing bandwidths, protocols, and routing technology. The HiPer-D project investigates the use of networked distrib- uted computing capabilities in battle management systems on U.S. Navy cruisers. The Battlefield Awareness and Data * A preliminary version of this paper appeared in the Proceedings of the 11th ISCA International Conference on Parallel and Distributed Com- puting (PDCS 1998). Dissemination (BADD) program develops techniques for delivering multimedia data to mobile troops over a combi- nation of wired and wireless networks [14]. From the above examples, it is clear that heterogene- ity is a salient characteristic of the interconnection network in local area clusters and distributed computational envi- ronments. Further, the network is shared among multiple applications. The performance therefore depends upon the current traffic conditions, and typically varies over time. For scalable performance on such a platform, support for fast application-level communication is necessary. Ef- ficient implementations of important collective communi- cation kernels must be incorporated into communication libraries. In this paper, we develop communication tech- niques for block-cyclic redistribution over such heteroge- neous networks. We consider the important case where the block size changes by a factor of K. Our techniques can also be extended to other redistribution problems. The block-cyclic distribution is widely used in many HPC applications to partition an array over multiple proces- sors. For example, in signal processing applications, the block-cyclic distribution is the natural choice for radar and sonar data cubes. Many of the frequently occurring communication patterns, such as the corner turn operation, can be then viewed as block-cyclic redistribution opera- tions [9]. ScaLAPACK, a widely used mathematical soft- ware for dense linear algebra computations, also uses a block-cyclic distribution for good load balance and compu- tational efficiency. Matrix transpose operations, which of- ten occur in linear algebra computations, are a special case of the block-cyclic redistribution. HPF provides directives for specifying block-cyclic distribution and redistribution of arrays. The problem of block-cyclic redistribution in a tightly- coupled homogeneous parallel system has been well re-  Baltzer Science Publishers BV