Delay-based Cloud Congestion Control
Mitchell Gusat
∗
, Robert Birke
†
, Cyriel Minkenberg
∗
†
Dip. di Elettronica, Politecnico di Torino, Italy, Email: birke@tlc.polito.it
∗
IBM Zurich Research Laboratory GmbH, Switzerland Email: {mig,sil}@zurich.ibm.com
Abstract—As the Internet owes its scalability and stability to
TCP, congestion control also plays a key role in the performance,
efficiency, and stability of datacenters, as evidenced by the
efforts to standardize congestion management (CM) for 10+ Gbps
networks. The next step up from datacenters is CM for clouds.
New solutions are necessary for clouds because, while the recent
CM schemes have been tested on L2 reliable networks, they are
limited by design to relatively small single domain datacenters.
TCP, on the other hand, while scalable and continously evolving,
was not designed for μs-latency lossless networks. To address this
problem we first investigate whether path delay could serve as a
reliable congestion measure for clouds spanning across multiple
datacenters and heterogenous networks. A qualitative open loop
analysis of two datacenter networks (10 Gb/s Ethernet and
12x Infiniband) yields positive results. We close the congestion
control loop by adapting the delay observer to a re-designed
AIMD controller whose base algorithm we extensively analyzed
in IEEE 802.1Qau. However, despite the statistical correlation
between congestion severity and delay, the low signal/noise ratio
(1-2dB), and the congestion notification lag threaten the closed
loop stability. Hence we design a combination of two original
filters, called Dual-Edge KDS-CUSUM. After preliminary loop
tuning in Matlab, simulation results in OMNeT++ bear out the
trade-off between stability and dynamic response. The concept
is validated on a cluster software implementation.
I. CLOUD AND DATACENTER TRENDS
Cost and power constrains are leading toward IT federation,
i.e., the aggregation of server blades, clusters, and storage
networks into increasingly larger datacenters (DC). As DCs
are normally worst-case dimensioned to sustain the likely
peak load, they are largely under-utilized; anecdotal evidence
suggests a mean system utilization under 20%. To increase ef-
ficiency, the DC infrastructure—servers, storage and routers—
is virtualized to create multiple images of the actual systems,
each image running its own workload. An ensemble of one or
more such virtualized DCs is loosely called a cloud.
Consolidation also governs the network space: the currently
distinct SAN, StAN, and LAN solutions used for clouds
are wasteful and complex when combined under a single
roof. While for now 1-10 Gigabit Ethernet (GE), 10-40 Gbps
Infiniband (IBA), Fibre Channel, and Myrinet still coexist in
the same cloud, eventually their traffic will be aggregated on a
single network. Recent standardization efforts are converging
toward a unified datacenter network (DCN), possibly based
on a DC Bridging-Ethernet (DCB) [1]. The resulting increase
in utilization should not, however, decrease the application
performance. On the contrary, cloud service-level agreements
(SLAs) impose stricter transaction latency, jitter bounds, and
sustained throughput, hence the need for efficient scheduling
and congestion control.
Figure 1: Hotspot saturation tree in an 128-node fat tree DCN.
A. Congestion in Link-Level Flow-Controlled DCNs
To prevent packet drops in DCNs, link-level flow control
(LL-FC) is commonly used. In Ethernet this is accomplished
by ’pausing’ a link. When the buffer occupancy reaches a
’stop’ threshold, the switch issues a PAUSE frame on the
respective port, instructing the sender on the other side of the
link to stop sending for a given period of time(out). When the
occupancy sinks below a ’start’ threshold, the switch issues a
new PAUSE frame with the duration set to zero, so that the
sender can resume transmission without waiting for the time-
out. By configuring the stop/start thresholds to account for the
link round-trip time (RTT), packet drops due to buffer overflow
can be avoided. A similar mechanism in IBA uses credits
to achieve the low latency and losslessness requirements of
DC and cloud applications at a lower implementation cost.
However, these mechanisms can only curb transient congestion
on μs timescales.
B. Saturation Trees
The cost of losslessness through LL-FC is in high-order
head-of-line (HOL) blocking [2], saturation tree congestion
[3], and possibly deadlocks.
Figure 1 illustrates the problem. If a sufficient fraction of all
the inputs’ traffic targets one of the outputs (in the figure, the
output labeled 128), that output link can saturate: it becomes
a hotspot (HS) that causes the queues in the switch feeding
that link to fill up. If the traffic pattern persists, then, no
matter what techniques are used to reassign buffer space, it
will be ultimately exhausted. This forces that switch’s LL-FC
to throttle back its inputs, which in turn causes the previous
stage to fill its buffer space. In a domino effect, the congestion
eventually backs up to the DCN inputs. This has been called
tree saturation [4] or congestion spreading. Saturation spreads
quickly via LL-FC; according to the analysis of [3], the tree
is filled in less than 10 traversal times of the network, far too
This full text paper was peer reviewed at the direction of IEEE Communications Society subject matter experts for publication in the IEEE "GLOBECOM" 2009 proceedings.
978-1-4244-4148-8/09/$25.00 ©2009