FLOW VECTOR PREDICTION Tarem Ahmed and Mark Coates, McGill University {tahmed, coates}@tsp.ece.mcgill.ca ABSTRACT This paper considers the problem of predicting the num- ber, length and distribution of trafﬁc ﬂows some time into the future, based upon packets collected in the present. Three methods– the standard Expectation-Maximization algorithm, a distributed version of the Expectation-Maximization algo- rithm, and a Particle Filter– are used to predict the mean ﬂow length and complete ﬂow distributions for subsequent timesteps. We propose a model to represent the histogram of ﬂows corresponding to any given time interval, and use the aforementioned methods to estimate the parameters of the model. The proposed algorithms are tested on a large number of commonly-available data traces. The results in- dicate that the three methods perform comparably well in terms of the distance between the predicted ﬂow distribu- tions and actual ﬂow histograms. An important application of our work is in resource reservation for protocols that re- quire guaranteed qualities of service. 1. INTRODUCTION Trafﬁc passing through a node corresponds to many differ- ent types of applications and protocols. It is important for routers to be able to predict certain characteristics about the nature of network trafﬁc, ahead of time. Quantities of in- terest include the number of distinct ﬂows within a time period, the lengths of these ﬂows, and the distribution of ﬂow lengths. Flow scheduling is an important factor to con- sider in efﬁciently utilizing available bandwidth. Delays from two edge nodes are typically up to 100ms in wide-area networks such as the Agile All-Photonic Network (AAPN). Assuming a frame length of 10ms, this often means that reservation of bandwidth needs to be made at least 10 frames in advance. We begin by suggesting a suitable model for the ﬂow distribution, and then present three methods for estimating the parameters of the model. The estimated parameters can then be used to predict the number and distribution of ﬂow lengths in future time intervals. We test our methods by comparing the predictions based on our estimates, with the actual ﬂow histograms. This project was funded by AAPN. Examples of use of information on ﬂow distribution at core routers include the following: Resource Reservation: Certain classes and types of trafﬁc require guaranteed qualities of service. Knowledge of the amount of such trafﬁc to expect with a time frame is re- quired to reserve resources in intermediate routers, accord- ingly. Sampling Rates: Keeping a record for every packet is in- feasible in high-speed routers. Such routers randomly sam- ple packets, and estimate statistics about the original packet stream from the sampled data. The efﬁciency of the sam- pling scheme depends on the ﬂow distribution of the origi- nal stream [1]. Resource Utilization: Knowledge of the distribution of traf- ﬁc ﬂows is needed to evaluate gains in the deployment of web proxies [2] and to study efﬁciency of cache utilization. Characterizing Source Trafﬁc: Information about the distri- bution of ﬂows can provide insight into the higher level pro- tocols that the trafﬁc corresponds to (e.g. real-time, per-to- peer, etc.), and help determine thresholds for creating new connections in ﬂow-switched networks [3]. Trafﬁc Engineering: One could use information on ﬂow dis- tribution to balance the total volume of trafﬁc at a core node, based on a small number of identiﬁed ﬂows [4]. Moreover, the complexity of optimizing algorithms for multipath rout- ing is reduced if the number of ﬂows is limited [5–7]. 1.1. Deﬁnitions An IP ﬂow is usually deﬁned to be a set of packets that share a common key, and occur within some period. We deﬁne the key to be the following 4-tuple: key =(source IP address, source port number, destination IP address, destination port number) Thus a ﬂow refers to a connection between speciﬁc applica- tions in speciﬁc end systems. The ﬂow length is deﬁned to be the number of packets that belong to a particular ﬂow (as identiﬁed by its key). In order to compile ﬂow statistics, routers maintain records indexed by ﬂow keys. A ﬂow is said to be active if a record exists for its key. Once a new packet arrives at a router, the router ﬁrst determines if a record is active for the ﬂow, based on the new packet’s key. If not, a new record is created with the packet’s key. For a record to be active for the arriving