Congestion Control Using Policy Rollout Gang Wu School of Electrical and Computer Engineering Purdue University West Lafayette, Indiana 47907 Email: gwu@ecn.purdue.edu Edwin K. P. Chong Department of Electrical and Computer Engineering Colorado State University Fort Collins, Colorado 80523 Email: echong@engr.colostate.edu Robert Givan School of Electrical and Computer Engineering Purdue University West Lafayette, Indiana 47907 Email: givan@ecn.purdue.edu Abstract— We consider the congestion-control problem in a communication network with multiple traffic sources, each mod- elled as a fully-controllable stream of fluid traffic and associated with a unique round-trip delay. The bandwidth available to the controlled sources is stochastic due to high-priority cross traffic, described by a Markov-modulated fluid. The goal is to maximize a linear combination of the throughput, delay, and traffic loss at the bottleneck node, while achieving fairness among controlled sources. The control problem is posed as a Markov decision process (MDP). We heuristically solve the MDP via a technique called policy rollout. Our empirical study demonstrates that the control scheme performs significantly better than conventional congestion controllers. We further find that employing different estimates of the “Q-value” in solving the MDP leads to comparable overall cumulative rewards, although the component contributions can be quite different. I. I NTRODUCTION We study the congestion-control problem in a network where a bottleneck node is shared by “best-effort-traffic” sources and other high-priority “cross-traffic” sources. The best-effort sources can be fully controlled; each such source originates at a unique distance from the bottleneck node and thus has a unique control delay. Taking these delays into account in decision-making is a difficult problem. The objective of congestion control is to determine proper and fair transmission rates for the best-effort sources to utilize efficiently the bandwidth available to them at the bottleneck node while achieving low queuing delay and a low traffic loss rate. In ATM (Asynchronous Transfer Mode) networks, best- effort traffic can be ABR (Available-Bit-Rate) or UBR (Unspecified-Bit-Rate) traffic. High-priority cross traffic rep- resents CBR (Constant-Bit-Rate) or VBR (Variable-Bit-Rate) traffic. In IP (Internet-Protocol) networks, best-effort traffic can be traffic receiving low-priority service via the CBQ (Class-based Queuing) scheme [1], and high-priority traffic can be traffic receiving high-priority service in the CBQ scheme. We assume that we are provided with a stochastic model of the high-priority cross traffic; the cross traffic at the bottleneck node is a Markov-modulated fluid (MMF) [2]. We formulate the congestion-control problem as a discrete-time Markov This research was supported in part by NSF under grants ECS-0098089, ANI-0099137, ANI-0207892, and IIS-0093100. decision process (MDP) [3] and use a measure of performance over long traces of cross-traffic variation, balancing through- put, delay, and loss. Previous work on congestion control using MDP formulation includes [4], [5], [6]. Our work differs from [4], [5], [6] in the sizes of action spaces, the generality of reward structures, and solution methods. In particular, this paper differs from our earlier work in [6] in two ways. First, to achieve fairness in [6], we had a penalty in the reward structure on the difference in the amounts of arriving traffic from the controlled sources. Here we take a different approach; we aim to attain fairness by “hard-wiring” the system such that a common rate command is calculated for all sources at a control epoch. Consequently, we have a simplified reward structure and reduced action and state spaces. Second, instead of the Hindsight-Optimization (HO) technique used in [6] to obtain an estimate of the gradient of the “Q-value” (defined later) with respect to the action, here we explore a method called Policy Rollout (PR), introduced in [7], [8], [9] for solving complex MDP problems. Our empirical study shows that the cumulative rewards attained in this paper are comparable to those in [6], and fairness is improved. Our controller is significantly more sophisticated than the congestion-control mechanism in TCP (Transmission Control Protocol), widely used in the current Internet. Built in a distributed-control fashion, TCP is suitable to be deployed at end users for scalable congestion control at the expense of performance. Our controller, however, adopts a centralized- control approach and is designed for high performance. Our controller is suitable for deployment, for example, at edge routers, where congestion is more likely to occur than else- where and where scalability does not impose serious prob- lems, and thus our controller is a good choice for high- performance congestion control. Our experiments show that the performance of TCP is vastly worse than that of centralized controllers, e.g., our controller and the well-known family of proportional-derivative (PD) controllers. Therefore, in our em- pirical study, we focus on comparison between our controller and the PD controllers. With minor modification to the equation governing the buffer dynamics, our technique also applies to the situation where cross traffic is queued together with the controlled traffic in a common buffer; i.e., traffic is not prioritized. This model is familiar in the current Internet, where cross traffic represents