Optimizing Collective Communications on the K-port Spidergon Network Jiri Jaros Dept. of Computer Systems Faculty of Information Technology, BUT Brno, Czech Republic e-mail: jarosjir@fit.vutbr.cz Vaclav Dvorak Dept. of Computer Systems Faculty of Information Technology, BUT Brno, Czech Republic e-mail: dvorak@fit.vutbr.cz Abstract—The paper investigates an impact of using k ports in the direct communication model of collective communications on the overall performance of the Spidergon interconnection network. Since the higher number of k internal ports can improve performance but increase the cost of interconnection network, the performed analysis introduces the ideal performance-cost tradeoff on slim- and fat-node Spidergon networks. Keywords-collective communications; k-port model; Spidergon; slim-nodes; fat-nodes; router usage; latency I. INTRODUCTION With an increasing number of processor cores, memory modules and other hardware units in the latest chips, the importance of communication among them and of related interconnection networks is steadily growing. The memory of many-core systems is physically distributed among computing nodes that communicate by sending data through a Network on Chip (NoC) [1]. Communication operations can be either point-to-point, with one source and one destination, or collective, with more than two participating processes. Some embedded parallel applications, like network or media processors, are characterized by independent data streams or by a small amount of inter-process communications [2]. However, many general-purpose parallel applications display a bulk synchronous behavior: the processing nodes access the network according to a global, structured communication pattern. The performance of these collective communications (CC for short) has a dramatic impact on the overall efficiency of parallel processing. The most efficient way to switch messages through the network connecting multiple processing elements (PEs) makes use of wormhole (WH) switching. Wormhole switching reduces the effect of path length on communication time, but if multiple messages exist in the network concurrently (as it happens in CCs), contention for communication links may be a source of congestion and waiting times. To avoid congestion delays, CCs are necessary to organize into separated steps in time and to put into each step only such pair-wise communications whose paths do not share any links. The contention-free scheduling of CCs is therefore important. The port model of the system defines the number k of PE ports that can be engaged in communication simultaneously. This means that beside 2d network channels, there are 2k internal unidirectional (DMA) channels, k input and k output channels, connecting each local processor core to its router that can transfer data simultaneously. Always k ≤ d, where d is a node degree; a one-port model (k=1) and an all-port router model (k=d) are most frequently used. Typically, higher number of ports reduces communication overhead, but on the other hand, increases the complexity of routers and duplicates network interfaces in connected PEs. In the most common one-port system, a PE has to transmit (and/or receive) messages sequentially (using only one local channel). The messages may block on occupied injection channel, even when their required network channels are free. These systems are very easy to implement and are often used in computer clusters equipped with only one network interface. Architectures with multiple ports alleviate this bottleneck. In the all-port router architecture, there are as many local PE channels as there are network channels that reduce the message blocking latency during CC operations. On the other hand, an addition of internal ports requires more complex router and makes the system more expensive. Such all-port routers can be often found in systems on a chip. Fig. 1 illustrates the differences between one-port and all- port switches. Figure 1. Port models for 3-regular Spidergon network. The k-port model is a generalization of the port models and has been widely used, e.g., in [3] and [4]. An appropriate number of internal ports can boost the performance and keep the router complexity at reasonable level. One example of successful k-port NoCs implementation is presented in [5] and [6]. The authors investigate the speedup of broadcast communication inside the Cell Broadband Engine processor [7], [8] and prove that using multi-ports (up to four) significantly reduces the broadcast latency of short messages. Unfortunately, no idea about other communication patterns was given there. local CPU ports local CPU ports (a) one-port model (b) all-port model 24 ICONS 2011 : The Sixth International Conference on Systems Copyright (c) IARIA, 2011 ISBN:978-1-61208-114-4