FLOW VECTOR PREDICTION Tarem Ahmed and Mark Coates, McGill University {tahmed, coates}@tsp.ece.mcgill.ca ABSTRACT This paper considers the problem of predicting the num- ber, length and distribution of traffic flows some time into the future, based upon packets collected in the present. Three methods– the standard Expectation-Maximization algorithm, a distributed version of the Expectation-Maximization algo- rithm, and a Particle Filter– are used to predict the mean flow length and complete flow distributions for subsequent timesteps. We propose a model to represent the histogram of flows corresponding to any given time interval, and use the aforementioned methods to estimate the parameters of the model. The proposed algorithms are tested on a large number of commonly-available data traces. The results in- dicate that the three methods perform comparably well in terms of the distance between the predicted flow distribu- tions and actual flow histograms. An important application of our work is in resource reservation for protocols that re- quire guaranteed qualities of service. 1. INTRODUCTION Traffic passing through a node corresponds to many differ- ent types of applications and protocols. It is important for routers to be able to predict certain characteristics about the nature of network traffic, ahead of time. Quantities of in- terest include the number of distinct flows within a time period, the lengths of these flows, and the distribution of flow lengths. Flow scheduling is an important factor to con- sider in efficiently utilizing available bandwidth. Delays from two edge nodes are typically up to 100ms in wide-area networks such as the Agile All-Photonic Network (AAPN). Assuming a frame length of 10ms, this often means that reservation of bandwidth needs to be made at least 10 frames in advance. We begin by suggesting a suitable model for the flow distribution, and then present three methods for estimating the parameters of the model. The estimated parameters can then be used to predict the number and distribution of flow lengths in future time intervals. We test our methods by comparing the predictions based on our estimates, with the actual flow histograms. This project was funded by AAPN. Examples of use of information on flow distribution at core routers include the following: Resource Reservation: Certain classes and types of traffic require guaranteed qualities of service. Knowledge of the amount of such traffic to expect with a time frame is re- quired to reserve resources in intermediate routers, accord- ingly. Sampling Rates: Keeping a record for every packet is in- feasible in high-speed routers. Such routers randomly sam- ple packets, and estimate statistics about the original packet stream from the sampled data. The efficiency of the sam- pling scheme depends on the flow distribution of the origi- nal stream [1]. Resource Utilization: Knowledge of the distribution of traf- fic flows is needed to evaluate gains in the deployment of web proxies [2] and to study efficiency of cache utilization. Characterizing Source Traffic: Information about the distri- bution of flows can provide insight into the higher level pro- tocols that the traffic corresponds to (e.g. real-time, per-to- peer, etc.), and help determine thresholds for creating new connections in flow-switched networks [3]. Traffic Engineering: One could use information on flow dis- tribution to balance the total volume of traffic at a core node, based on a small number of identified flows [4]. Moreover, the complexity of optimizing algorithms for multipath rout- ing is reduced if the number of flows is limited [5–7]. 1.1. Definitions An IP flow is usually defined to be a set of packets that share a common key, and occur within some period. We define the key to be the following 4-tuple: key =(source IP address, source port number, destination IP address, destination port number) Thus a flow refers to a connection between specific applica- tions in specific end systems. The flow length is defined to be the number of packets that belong to a particular flow (as identified by its key). In order to compile flow statistics, routers maintain records indexed by flow keys. A flow is said to be active if a record exists for its key. Once a new packet arrives at a router, the router first determines if a record is active for the flow, based on the new packet’s key. If not, a new record is created with the packet’s key. For a record to be active for the arriving