Token Flow Control Amit Kumar, Li-Shiuan Peh and Niraj K. Jha Department of Electrical Engineering Princeton University, Princeton, NJ 08544 Email: {amitk, peh, jha}@princeton.edu Abstract As companies move towards many-core chips, an efficient on- chip communication fabric to connect these cores assumes crit- ical importance. To address limitations to wire delay scalability and increasing bandwidth demands, state-of-the-art on-chip networks use a modular packet-switched design with routers at every hop which allow sharing of network channels over multiple packet flows. This, however, leads to packets going through a complex router pipeline at every hop, resulting in the overall communication energy/delay being dominated by the router overhead, as opposed to just wire energy/delay. In this work, we propose token flow control (TFC), a flow control mechanism in which nodes in the network send out tokens in their local neighborhood to communicate information about their available resources. These tokens are then used in both routing and flow control: to choose less congested paths in the network and to bypass the router pipeline along those paths. These bypass paths are formed dynamically, can be arbitrarily long and, are highly flexible with the ability to match to a packet’s exact route. Hence, this allows packets to potentially skip all routers along their path from source to destination, approaching the communication energy-delay- throughput of dedicated wires. Our detailed implementation analysis shows TFC to be highly scalable and realizable at an aggressive target clock cycle delay of 21FO4 for large networks while requiring low hardware complexity. Evaluations of TFC using both synthetic traffic and traces from the SPLASH-2 benchmark suite show reduction in packet latency by up to 77.1% with upto 39.6% reduction in aver- age router energy consumption as compared to a state-of-the- art baseline packet-switched design. For the same saturation throughput as the baseline network, TFC is able to reduce the amount of buffering by 65% leading to a 48.8% reduction in leakage energy and a 55.4% lower total router energy. 1. Introduction The current trend in utilizing the growing number of tran- sistors provided by each technology generation is to use a modular design with several computation cores on the same chip. As the number of such on-chip cores increases, a scalable and high-bandwidth communication fabric to connect them becomes critically important. As a result, packet-switched on- chip networks are fast replacing buses and crossbars to emerge as the pervasive communication fabric in both general-purpose chip multi-processor (CMP) [1]–[3] as well as application- specific system-on-a-chip (SoC) [4] domains. Apart from providing scalable and high-bandwidth commu- nication, on-chip networks are required to provide ultra-low latency with an extremely constrained power envelope and a low area budget. Most state-of-the-art packet-switched designs use a complex router at every node to orchestrate communication, and packets travel only a short distance on the link wires before having to go through a complete router pipeline at every intermediate hop along their path. As a result, communication energy/delay in such networks is dominated by the router overhead, in contrast to an ideal network where packet latency and energy are solely due to the wires between the source and destination. For instance, routers consume around 61% of the average network power in the MIT Raw chip as opposed to 39% consumed by the links [5]. Similarly, the Intel 80-core teraflops chip has router power taking 83% of network power versus 17% consumed by the links [3]. The large energy-delay-throughput gap between the state-of-the-art packet-switched network and the ideal interconnect of dedicated point-to-point wires was pointed out in [6]. In this work, we propose TFC, a flow-control mechanism which aims to deliver the energy-delay-throughput of dedicated wires through the use of tokens. Tokens are indications of resource availability in the network. Each node in the network sends out tokens in its fixed local neighborhood of d max hops to disseminate information about availability of resources, such as buffers and virtual channels (VCs) at its input ports. Individual packets then use these tokens during both routing – to find less congested routes in chunks of up to d max hops, and flow control – to bypass the router pipeline at intermediate nodes along these d max -hop routes. When one such d max -hop token route ends, another token route can be chained to it seamlessly without any additional energy-delay overhead. Thus, packets can use an arbitrary number of tokens to bypass all intermediate routers between their source to destination, like that in an ideal network. In the rest of this paper, Section 2 provides background for this work by looking at router energy/delay overhead in state-of- the-art packet-switched designs. This is followed by the working of TFC in Section 3 and its implementation details in Section 4. Evaluation results are presented in Section 5. Section 6 presents related work while Section 7 concludes the paper. 2. Background 2.1. Baseline state-of-the-art router Fig. 1(a) shows the microarchitecture of a state-of-the-art baseline VC router used for comparison in all our experiments. We assume a two-dimensional mesh topology for simplicity. Flit-level buffering and on/off VC flow control [7] are used to minimize the amount of buffering per router and hence its area footprint. This design incorporates several features which are critical to on-chip networks – low pipeline delay using lookahead routing [8], speculation [11], [12], no-load bypassing 978-1-4244-2837-3/08/$25.00 ©2008 IEEE 342