Buffered Clos-network packet switch with per-output flow queues Z. Dong, R. Rojas-Cessa and E. Oki Proposed is a buffered Clos-network switch with per-output flow queues in the middle-stage modules to avoid head-of-line blocking of the queues in the middle stage that occurs in Clos-network switches using simple crosspoint-buffered switch modules. It is shown that the proposed switch achieves higher performance than a conventional buf- fered Clos-network packet switch. Introduction: Current advances in chip fabrication allow high on-chip memory density. However, the pin count has mainly remained the same, thus limiting the number of ports a switch chip can support. A three-stage Clos-network switch [1] uses small switch modules, called input module (IM), central module (CM), and output module (OM), to implement a large switch (i.e. with a large number of ports) with an efficient amount of hardware (e.g. number of chips). For example, a 256 × 256 switch can be implemented with 48 16 × 16 switch modules. However, the configuration time of a Clos-network switch may be long as configuration information may recur to inter-chip signal- ling with long delays. To reduce this configuration delay, buffers in the switch modules may be used. The placement of buffers at different stages defines the switch model, namely the memory – space– memory Clos-network (MSM) switch [2] (buffers in IMs and OMs), space– memory – memory Clos-network (SMM) switch [3] (buffers in CMs and OMs), and memory – memory – memory (MMM) Clos-network switches [4–6] (buffers in IMs, CMs, and OMs). The MSM and SMM switches require large configuration times, proportional to the switch size [2, 7]. The MMM switch performs separate selections at each stage. This reduces the configuration time and increases the scal- ability of the Clos-network switch. In this Letter, we follow the mainstream approach of switching fixed- length packets, called cells, in the switch. The incoming variable-length packets are segmented into cells and re-assembled before they leave the switch. Therefore, it takes a fixed amount of time, called a time slot, to forward a cell from the input of a switch module to the output of the switch module. For example, a time slot for 512-bit cells is 51.2 ns under a link rate of 10 Gbit/s. A conventional MMM Clos-network switch, or MMM switch for brevity, adopts queues, one per OM, in the CM where cells destined to different output ports of a destination OM are stored, as buffered- crossbars can be used as off-the-shelf switch modules [4, 6]. A head- of-line (HoL) cell in the queue may block the cells behind destined to other output ports that have available room in the destined OM [8]. Thus HoL blocking at the CM may occur. Switch performance degrades as HoL blocking degrades the switch throughput [9]. Therefore, avoid- ance of HoL blocking at the CMs in an MMM switch is needed. In this Letter, we propose an MMM Clos-network switch with per- output flow, which is the set of packets going from input i to output j, queues in the CMs, called the MM e M switch. In the MM e M switch, sep- arate queues, one dedicated queue per flow for an output port, are allo- cated at each crosspoint buffer in the CMs to avoid HoL blocking. We show that the MM e M switch outperforms an MMM switch in terms of throughput and delay. OP(p–1, n–1) IM(p–1) VCMQ(i,g,r) L I (i,r) LC(r,j) VOPQ(r,j,h) n CM(0) CM(m–1) OM(0) OM(p–1) 0 N–1 VOQ(i,g,j,h) IP(0, 0) 0 N–1 IP(0, n–1) 0 N–1 0 N–1 IP(p–1, 0) IP(p–1, n –1) OP(0, 0) OP(0, n–1) OP(p–1, 0) VOMQ(i,r,j) 0,0,m–1 0,n–1,m–1 0,0,0 0,n–1,0 IM(0) p–1,0,m–1 p–1,n–1,m–1 p–1,0,0 p–1,n–1,0 0,0,p–1 p–1,0,p–1 0,0,0 p–1,0,0 0,m–1,p–1 p–1,m–1p–1 0,m–1,0 p–1,m–1,0 0,0,n–1 m–1,0,n–1 0,0,0 m–1,0,0 0,p–1,n–1 m–1,p–1,n–1 0,p–1,0 m–1,p–1,0 Fig. 1 N × N MMM switch CM POFQ i,r,j,h 0,0,0,0 0,0,0,n–1 p–1,0,0,n–1 p –1,0,0,0 0,0,p–1,0 0,0,p –1,n–1 p–1,0,p–1,0 p–1,0,p–1,n –1 Fig. 2 Central module of N × N MM e M switch MMM Clos-network switch with per-output flow queues (MM e M): Fig. 1 shows an N × N MMM switch. The proposed MM e M switch has IMs and OMs similar to those in an MMM switch. To avoid HoL blocking, the MM e M switch uses per-output flow queues at the CMs, as shown in Fig. 2. The MM e M switch consists of p IMs, m CMs, and p OMs. Each IM/OM has n input/output ports. Here, N ¼ p × n. There are N virtual output queues (VOQs) at each input port, n × m virtual CM queues (VCMQs) at each IM, p × p × n per-output flow queues (POFQs) at each CM, and m × n virtual output port queues (VOPQs) at each OM. The terminology used in this Letter is presented in Table 1. Table 1: Notations Terminology Definition IM(i ) (i + 1)th input module CM(r) (r + 1)th central module OM( j ) ( j + 1)th output module IP (i, g) (g + 1)th input port at IM(i ) OP( j, h) (h + 1)th output port at OM( j ) L I (i, r) Output link of IM(i ) that is connected to CM(r) L C (r, j ) Output link of CM(r) that is connected to OM( j ) VOQ(i, g, j, h) Virtual output queue at IP(i, g) that stores cells destined to OP( j, h) VCMQ(i, g, r) Virtual central module queue at IM(i ) that stores cells from IP (i, g) to go through CM(r) to its output port VOMQ(i, r, j ) Virtual output module queue at CM(r) that stores cells from IM(i ) and destined to OM( j ) (MMM switch) POFQ(i, r, j, h) Per-output flow queue at CM(r) that stores cells from L I (i, r) and destined to OP j,h (MM e M switch) VOPQ(r, j, h) Virtual output port queue at OM( j ) that stores cells from CM(r) and destined to OP( j, h) k 1 Size of crosspoint queues in IMs, in cells k 2 Size of crosspoint/per-output flow queues in CMs of MMM and MM e M switches, respectively, in cells k 3 Size of crosspoint queues in OMs, in cells in OMs In this Letter, round-robin (RR) and longest queue first (LQF) selec- tion schemes are considered as input arbitration schemes to observe the maximum switch performance under uniform and nonuniform traffic, respectively. Other schemes can also be adopted. The selection of CMs at the IMs, and the arbitration at CMs and OMs are RR-based. Credit-based flow control is used at each module to avoid queue overflow. Performance evaluation: The performance of 256 × 256 MM e M and MMM switches was evaluated under uniform traffic with Bernoulli and bursty arrivals, and under nonuniform traffic with Bernoulli arrivals. For the MM e M switch, k 1 ¼ k 2 ¼ k 3 ¼ 1 cell, which is the minimum size of the crosspoint buffer. For the MMM switch, k 1 ¼ k 3 ¼ 1 cell and k 2 ¼ {1, 16} cells. The latter value of k 2 is used to compare the performance of both switches with the same amount of memory. The switches were modelled in C language for event-driven simulation. Simulation results were obtained with a 95% confidence interval, with standard error not greater than 5% for the average queuing delay. Uniform traffic: The selection scheme at the input port, IM, CM and OM arbiters for the MMM and MM e M switches is RR. Traffic is con- sidered with Bernoulli and bursty arrivals, where l is the average burst length and a cell burst is defined by an on–off Markov modulated ELECTRONICS LETTERS 6th January 2011 Vol. 47 No. 1