IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 21, NO. 8, AUGUST 2013 1481
Low Propagation Delay Load-Balanced
4 × 4 Switch Fabric IC in 0.13-μm
CMOS Technology
Ching-Te Chiu, Yu-Hao Hsu, Wei-Chih Lai, Jen-Ming Wu, Shawn S. H. Hsu,
Yang-Syu Lin, Fan-Ta Chen, Min-Sheng Kao, and Yar-Sun Hsu
Abstract— A load-balanced Birkhoff-von Neumann (LB-BvN)
4 × 4 switch fabric IC is proposed for feedback-based switch
systems. This is fabricated in 0.13-μm CMOS technology and
the chip area is 1.380 × 1.080 mm
2
. The overall data rate of the
LB-BvN 4 × 4 switch fabric IC is up to 32 Gb/s (8 Gb/s/channel)
with only 0.8 ns propagation delay. The LB-BvN switch is highly
recommended for constructing the next-generation terabit switch.
In a feedback-based switch system, the long propagation delay of
the switch module reduces the system throughput significantly. In
this paper, we present a scalable LB-BvN 4 × 4 switch fabric IC
directly in the high-speed domain. By observing the deterministic
switching pattern of the N × N LB-BvN switch, we present a low-
complexity pattern generator that reduces the PG complexity
from O( N
3
) to O(1). This technique reduces the propagation
delay of the switch module from 30 to 0.8 ns, and also provides
80% area saving and 85% power saving compared to serializer–
deserializer interfaces. The proposed LB-BvN 4 × 4 switch fabric
IC is suitable for feedback-based switch systems to solve the
throughput degradation problem.
Index Terms— Current-mode logic (CML), load-balanced
Birkhoff-von Neumann switch, low propagation delay, scalability,
serializer–deserializer (SerDes), switch fabric IC.
I. I NTRODUCTION
M
ORE and more computers and commercial devices
communicate with each other either by wired or wire-
less connections, and this revolution has led to increasing data
traffic in networks. With the availability of high-speed internet,
Manuscript received October 10, 2011; revised May 15, 2012; accepted
July 21, 2012. Date of publication September 4, 2012; date of current
version July 22, 2013. This work was supported in part by the National
Science Council, Taiwan, under Contract NSC 97-2221-E-007-112-MY3 and
the Advanced Research for Next-Generation Networking and Communications
Project 98N2502E.
C.-T. Chiu and W.-C. Lai are with the Department of Computer Science
and the Institute of Communications Engineering, National Tsing
Hua University, Hsinchu 300, Taiwan (e-mail: ctchiu@cs.nthu.edu.tw;
andy751026@hotmail.com).
Y.-H. Hsu is with the Embedded SRAM Library Department, Taiwan
Semiconductor Manufacturing Company, Hsinchu 300-77, Taiwan (e-mail:
shhsu@ee.nthu.edu.tw).
J.-M. Wu, S. S. H. Hsu, F.-T. Chen, M.-S. Kao, and Y.-S. Hsu are with
the Department of Electrical Engineering, National Tsing Hua University,
Hsinchu 300, Taiwan (e-mail: jmwu@ee.nthu.edu.tw; shhsu@ee.nthu.edu.tw;
fanta524cf@yahoo.com.tw; kaom0711@gmail.com; yshsu@ee.nthu.edu.tw).
Y.-S. Lin is with the High Speed Memory Development Program, Taiwan
Semiconductor Manufacturing Company, Hsinchu 300-77, Taiwan (e-mail:
yslinze@tsmc.com).
Color versions of one or more of the figures in this paper are available
online at http://ieeexplore.ieee.org.
Digital Object Identifier 10.1109/TVLSI.2012.2212618
cloud computing services are being provided to corporate
and individual users. These applications, which require high
bandwidths, have become more and more popular, and this
trend is set to continue. Therefore, to support high-bandwidth
traffic, the performance of internet routers and switches should
grow drastically.
Most switches currently available in the market are based
on the shared memory switch architecture, which is one of
the output-buffered switches [1]. In this architecture, packets
are stored and forwarded in a common shared memory. As
the speed of fiber optics advances, the memory access speed
becomes a bottleneck (scalability problem). If the line rate is
R, an N × N common shared memory has to deal with at
most 2 × N × R data rate in the same time. To achieve higher
speed, one has to use parallel-buffered switch architectures to
obtain the needed speedup. One common approach, known as
the input-queued switch architecture, is to have parallel buffers
in front of a switch fabric [2].
An input-queued switch, with each input maintaining a
single first-in first-out (FIFO) queue, may suffer head-of-line
(HOL) blocking problem and then result in degradation in
throughput down to 58% [3]. One way to solve this problem is
the virtual output queuing (VOQ) technique, which maintains
a separate queue for each output at each input. As there are N
2
buffers (memories) at the inputs of an N × N switch fabric, the
key problem of input-queued switches (equipped with VOQs)
is to apply a certain matching algorithm to choose at most
N of N
2
HOL packets to transmit through the switch fabric
[3]–[9]. The maximum weight matching algorithm can find a
solution to this problem but, unfortunately, has a complexity
of O ( N
2.5
log N ) [10], which makes it difficult to implement
in practice.
Several heuristics have been proposed to lower the com-
plexity. The iSLIP has a time complexity of O (log N ) to
converge with maximum matching using 2 N arbiters [11].
The computational complexity of randomized algorithms is
O (log N ) at the cost of increasing cell delay [12], [13].
The input-queued switch has a much longer delay than the
load-balanced switch when the traffic arrival rate is above
0.9 [14]. Matching algorithms for conflict resolution require
extra computation and communication overheads in every
time slot, and these overheads result in another scalability
issue. Furthermore, matching algorithms cannot guarantee
100% throughput theoretically without a speedup of 2 because
1063-8210/$31.00 © 2012 IEEE