Congestion Control Using Policy Rollout Gang Wu School of Electrical and Computer Engineering Purdue University West Lafayette, Indiana 47907 Email: gwu@ecn.purdue.edu Edwin K. P. Chong Department of Electrical and Computer Engineering Colorado State University Fort Collins, Colorado 80523 Email: echong@engr.colostate.edu Robert Givan School of Electrical and Computer Engineering Purdue University West Lafayette, Indiana 47907 Email: givan@ecn.purdue.edu Abstract— We consider the congestion-control problem in a communication network with multiple trafﬁc sources, each mod- elled as a fully-controllable stream of ﬂuid trafﬁc and associated with a unique round-trip delay. The bandwidth available to the controlled sources is stochastic due to high-priority cross trafﬁc, described by a Markov-modulated ﬂuid. The goal is to maximize a linear combination of the throughput, delay, and trafﬁc loss at the bottleneck node, while achieving fairness among controlled sources. The control problem is posed as a Markov decision process (MDP). We heuristically solve the MDP via a technique called policy rollout. Our empirical study demonstrates that the control scheme performs signiﬁcantly better than conventional congestion controllers. We further ﬁnd that employing different estimates of the “Q-value” in solving the MDP leads to comparable overall cumulative rewards, although the component contributions can be quite different. I. I NTRODUCTION We study the congestion-control problem in a network where a bottleneck node is shared by “best-effort-trafﬁc” sources and other high-priority “cross-trafﬁc” sources. The best-effort sources can be fully controlled; each such source originates at a unique distance from the bottleneck node and thus has a unique control delay. Taking these delays into account in decision-making is a difﬁcult problem. The objective of congestion control is to determine proper and fair transmission rates for the best-effort sources to utilize efﬁciently the bandwidth available to them at the bottleneck node while achieving low queuing delay and a low trafﬁc loss rate. In ATM (Asynchronous Transfer Mode) networks, best- effort trafﬁc can be ABR (Available-Bit-Rate) or UBR (Unspeciﬁed-Bit-Rate) trafﬁc. High-priority cross trafﬁc rep- resents CBR (Constant-Bit-Rate) or VBR (Variable-Bit-Rate) trafﬁc. In IP (Internet-Protocol) networks, best-effort trafﬁc can be trafﬁc receiving low-priority service via the CBQ (Class-based Queuing) scheme [1], and high-priority trafﬁc can be trafﬁc receiving high-priority service in the CBQ scheme. We assume that we are provided with a stochastic model of the high-priority cross trafﬁc; the cross trafﬁc at the bottleneck node is a Markov-modulated ﬂuid (MMF) [2]. We formulate the congestion-control problem as a discrete-time Markov This research was supported in part by NSF under grants ECS-0098089, ANI-0099137, ANI-0207892, and IIS-0093100. decision process (MDP) [3] and use a measure of performance over long traces of cross-trafﬁc variation, balancing through- put, delay, and loss. Previous work on congestion control using MDP formulation includes [4], [5], [6]. Our work differs from [4], [5], [6] in the sizes of action spaces, the generality of reward structures, and solution methods. In particular, this paper differs from our earlier work in [6] in two ways. First, to achieve fairness in [6], we had a penalty in the reward structure on the difference in the amounts of arriving trafﬁc from the controlled sources. Here we take a different approach; we aim to attain fairness by “hard-wiring” the system such that a common rate command is calculated for all sources at a control epoch. Consequently, we have a simpliﬁed reward structure and reduced action and state spaces. Second, instead of the Hindsight-Optimization (HO) technique used in [6] to obtain an estimate of the gradient of the “Q-value” (deﬁned later) with respect to the action, here we explore a method called Policy Rollout (PR), introduced in [7], [8], [9] for solving complex MDP problems. Our empirical study shows that the cumulative rewards attained in this paper are comparable to those in [6], and fairness is improved. Our controller is signiﬁcantly more sophisticated than the congestion-control mechanism in TCP (Transmission Control Protocol), widely used in the current Internet. Built in a distributed-control fashion, TCP is suitable to be deployed at end users for scalable congestion control at the expense of performance. Our controller, however, adopts a centralized- control approach and is designed for high performance. Our controller is suitable for deployment, for example, at edge routers, where congestion is more likely to occur than else- where and where scalability does not impose serious prob- lems, and thus our controller is a good choice for high- performance congestion control. Our experiments show that the performance of TCP is vastly worse than that of centralized controllers, e.g., our controller and the well-known family of proportional-derivative (PD) controllers. Therefore, in our em- pirical study, we focus on comparison between our controller and the PD controllers. With minor modiﬁcation to the equation governing the buffer dynamics, our technique also applies to the situation where cross trafﬁc is queued together with the controlled trafﬁc in a common buffer; i.e., trafﬁc is not prioritized. This model is familiar in the current Internet, where cross trafﬁc represents