Delay-based Cloud Congestion Control Mitchell Gusat ∗ , Robert Birke † , Cyriel Minkenberg ∗ † Dip. di Elettronica, Politecnico di Torino, Italy, Email: birke@tlc.polito.it ∗ IBM Zurich Research Laboratory GmbH, Switzerland Email: {mig,sil}@zurich.ibm.com Abstract—As the Internet owes its scalability and stability to TCP, congestion control also plays a key role in the performance, efﬁciency, and stability of datacenters, as evidenced by the efforts to standardize congestion management (CM) for 10+ Gbps networks. The next step up from datacenters is CM for clouds. New solutions are necessary for clouds because, while the recent CM schemes have been tested on L2 reliable networks, they are limited by design to relatively small single domain datacenters. TCP, on the other hand, while scalable and continously evolving, was not designed for μs-latency lossless networks. To address this problem we ﬁrst investigate whether path delay could serve as a reliable congestion measure for clouds spanning across multiple datacenters and heterogenous networks. A qualitative open loop analysis of two datacenter networks (10 Gb/s Ethernet and 12x Inﬁniband) yields positive results. We close the congestion control loop by adapting the delay observer to a re-designed AIMD controller whose base algorithm we extensively analyzed in IEEE 802.1Qau. However, despite the statistical correlation between congestion severity and delay, the low signal/noise ratio (1-2dB), and the congestion notiﬁcation lag threaten the closed loop stability. Hence we design a combination of two original ﬁlters, called Dual-Edge KDS-CUSUM. After preliminary loop tuning in Matlab, simulation results in OMNeT++ bear out the trade-off between stability and dynamic response. The concept is validated on a cluster software implementation. I. CLOUD AND DATACENTER TRENDS Cost and power constrains are leading toward IT federation, i.e., the aggregation of server blades, clusters, and storage networks into increasingly larger datacenters (DC). As DCs are normally worst-case dimensioned to sustain the likely peak load, they are largely under-utilized; anecdotal evidence suggests a mean system utilization under 20%. To increase ef- ﬁciency, the DC infrastructure—servers, storage and routers— is virtualized to create multiple images of the actual systems, each image running its own workload. An ensemble of one or more such virtualized DCs is loosely called a cloud. Consolidation also governs the network space: the currently distinct SAN, StAN, and LAN solutions used for clouds are wasteful and complex when combined under a single roof. While for now 1-10 Gigabit Ethernet (GE), 10-40 Gbps Inﬁniband (IBA), Fibre Channel, and Myrinet still coexist in the same cloud, eventually their trafﬁc will be aggregated on a single network. Recent standardization efforts are converging toward a uniﬁed datacenter network (DCN), possibly based on a DC Bridging-Ethernet (DCB) [1]. The resulting increase in utilization should not, however, decrease the application performance. On the contrary, cloud service-level agreements (SLAs) impose stricter transaction latency, jitter bounds, and sustained throughput, hence the need for efﬁcient scheduling and congestion control. Figure 1: Hotspot saturation tree in an 128-node fat tree DCN. A. Congestion in Link-Level Flow-Controlled DCNs To prevent packet drops in DCNs, link-level ﬂow control (LL-FC) is commonly used. In Ethernet this is accomplished by ’pausing’ a link. When the buffer occupancy reaches a ’stop’ threshold, the switch issues a PAUSE frame on the respective port, instructing the sender on the other side of the link to stop sending for a given period of time(out). When the occupancy sinks below a ’start’ threshold, the switch issues a new PAUSE frame with the duration set to zero, so that the sender can resume transmission without waiting for the time- out. By conﬁguring the stop/start thresholds to account for the link round-trip time (RTT), packet drops due to buffer overﬂow can be avoided. A similar mechanism in IBA uses credits to achieve the low latency and losslessness requirements of DC and cloud applications at a lower implementation cost. However, these mechanisms can only curb transient congestion on μs timescales. B. Saturation Trees The cost of losslessness through LL-FC is in high-order head-of-line (HOL) blocking [2], saturation tree congestion [3], and possibly deadlocks. Figure 1 illustrates the problem. If a sufﬁcient fraction of all the inputs’ trafﬁc targets one of the outputs (in the ﬁgure, the output labeled 128), that output link can saturate: it becomes a hotspot (HS) that causes the queues in the switch feeding that link to ﬁll up. If the trafﬁc pattern persists, then, no matter what techniques are used to reassign buffer space, it will be ultimately exhausted. This forces that switch’s LL-FC to throttle back its inputs, which in turn causes the previous stage to ﬁll its buffer space. In a domino effect, the congestion eventually backs up to the DCN inputs. This has been called tree saturation [4] or congestion spreading. Saturation spreads quickly via LL-FC; according to the analysis of [3], the tree is ﬁlled in less than 10 traversal times of the network, far too This full text paper was peer reviewed at the direction of IEEE Communications Society subject matter experts for publication in the IEEE "GLOBECOM" 2009 proceedings. 978-1-4244-4148-8/09/$25.00 ©2009