IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 21, NO. 8, AUGUST 2013 1481 Low Propagation Delay Load-Balanced 4 × 4 Switch Fabric IC in 0.13-μm CMOS Technology Ching-Te Chiu, Yu-Hao Hsu, Wei-Chih Lai, Jen-Ming Wu, Shawn S. H. Hsu, Yang-Syu Lin, Fan-Ta Chen, Min-Sheng Kao, and Yar-Sun Hsu Abstract— A load-balanced Birkhoff-von Neumann (LB-BvN) 4 × 4 switch fabric IC is proposed for feedback-based switch systems. This is fabricated in 0.13-μm CMOS technology and the chip area is 1.380 × 1.080 mm 2 . The overall data rate of the LB-BvN 4 × 4 switch fabric IC is up to 32 Gb/s (8 Gb/s/channel) with only 0.8 ns propagation delay. The LB-BvN switch is highly recommended for constructing the next-generation terabit switch. In a feedback-based switch system, the long propagation delay of the switch module reduces the system throughput signiﬁcantly. In this paper, we present a scalable LB-BvN 4 × 4 switch fabric IC directly in the high-speed domain. By observing the deterministic switching pattern of the N × N LB-BvN switch, we present a low- complexity pattern generator that reduces the PG complexity from O( N 3 ) to O(1). This technique reduces the propagation delay of the switch module from 30 to 0.8 ns, and also provides 80% area saving and 85% power saving compared to serializer– deserializer interfaces. The proposed LB-BvN 4 × 4 switch fabric IC is suitable for feedback-based switch systems to solve the throughput degradation problem. Index Terms— Current-mode logic (CML), load-balanced Birkhoff-von Neumann switch, low propagation delay, scalability, serializer–deserializer (SerDes), switch fabric IC. I. I NTRODUCTION M ORE and more computers and commercial devices communicate with each other either by wired or wire- less connections, and this revolution has led to increasing data trafﬁc in networks. With the availability of high-speed internet, Manuscript received October 10, 2011; revised May 15, 2012; accepted July 21, 2012. Date of publication September 4, 2012; date of current version July 22, 2013. This work was supported in part by the National Science Council, Taiwan, under Contract NSC 97-2221-E-007-112-MY3 and the Advanced Research for Next-Generation Networking and Communications Project 98N2502E. C.-T. Chiu and W.-C. Lai are with the Department of Computer Science and the Institute of Communications Engineering, National Tsing Hua University, Hsinchu 300, Taiwan (e-mail: ctchiu@cs.nthu.edu.tw; andy751026@hotmail.com). Y.-H. Hsu is with the Embedded SRAM Library Department, Taiwan Semiconductor Manufacturing Company, Hsinchu 300-77, Taiwan (e-mail: shhsu@ee.nthu.edu.tw). J.-M. Wu, S. S. H. Hsu, F.-T. Chen, M.-S. Kao, and Y.-S. Hsu are with the Department of Electrical Engineering, National Tsing Hua University, Hsinchu 300, Taiwan (e-mail: jmwu@ee.nthu.edu.tw; shhsu@ee.nthu.edu.tw; fanta524cf@yahoo.com.tw; kaom0711@gmail.com; yshsu@ee.nthu.edu.tw). Y.-S. Lin is with the High Speed Memory Development Program, Taiwan Semiconductor Manufacturing Company, Hsinchu 300-77, Taiwan (e-mail: yslinze@tsmc.com). Color versions of one or more of the ﬁgures in this paper are available online at http://ieeexplore.ieee.org. Digital Object Identiﬁer 10.1109/TVLSI.2012.2212618 cloud computing services are being provided to corporate and individual users. These applications, which require high bandwidths, have become more and more popular, and this trend is set to continue. Therefore, to support high-bandwidth trafﬁc, the performance of internet routers and switches should grow drastically. Most switches currently available in the market are based on the shared memory switch architecture, which is one of the output-buffered switches [1]. In this architecture, packets are stored and forwarded in a common shared memory. As the speed of ﬁber optics advances, the memory access speed becomes a bottleneck (scalability problem). If the line rate is R, an N × N common shared memory has to deal with at most 2 × N × R data rate in the same time. To achieve higher speed, one has to use parallel-buffered switch architectures to obtain the needed speedup. One common approach, known as the input-queued switch architecture, is to have parallel buffers in front of a switch fabric [2]. An input-queued switch, with each input maintaining a single ﬁrst-in ﬁrst-out (FIFO) queue, may suffer head-of-line (HOL) blocking problem and then result in degradation in throughput down to 58% [3]. One way to solve this problem is the virtual output queuing (VOQ) technique, which maintains a separate queue for each output at each input. As there are N 2 buffers (memories) at the inputs of an N × N switch fabric, the key problem of input-queued switches (equipped with VOQs) is to apply a certain matching algorithm to choose at most N of N 2 HOL packets to transmit through the switch fabric [3]–[9]. The maximum weight matching algorithm can ﬁnd a solution to this problem but, unfortunately, has a complexity of O ( N 2.5 log N ) [10], which makes it difﬁcult to implement in practice. Several heuristics have been proposed to lower the com- plexity. The iSLIP has a time complexity of O (log N ) to converge with maximum matching using 2 N arbiters [11]. The computational complexity of randomized algorithms is O (log N ) at the cost of increasing cell delay [12], [13]. The input-queued switch has a much longer delay than the load-balanced switch when the trafﬁc arrival rate is above 0.9 [14]. Matching algorithms for conﬂict resolution require extra computation and communication overheads in every time slot, and these overheads result in another scalability issue. Furthermore, matching algorithms cannot guarantee 100% throughput theoretically without a speedup of 2 because 1063-8210/$31.00 © 2012 IEEE