An Atomic Theory of Flow Behavior Stefan Karpinski, Elizabeth M. Belding, Kevin C. Almeroth Department of Computer Science University of California, Santa Barbara {sgk,ebelding,almeroth}@cs.ucsb.edu Abstract—We propose an entirely new approach to under- standing and analyzing the behavior of ﬂows in packet networks. The essential concept of this new approach is to ﬁnd atomic units of time and size behavior in traces of network ﬂows. While the nonparametric statistical techniques for extracting these basic behavioral units are complex, the end result is quite simple: the units provide an alphabet for ﬂow behavior. From a ﬁnite set of behavioral units, an inﬁnite variety of actual behaviors can be composed. The space of behaviors naturally becomes a vector space generated by these atomic units of behavior. We use numerical linear algebra techniques to demonstrate useful and immediate applications of this theory to real-time trafﬁc analysis, anomaly and attack detection, as well as workload generation for wireless experiments. I. I NTRODUCTION The trouble with trying to understand or model behav- ioral patterns in packet networks is that beyond the packets themselves there is no inherent behavioral structure. There are ﬂows of packets with the same source and destination IP addresses and TCP/UDP port numbers, and sessions of ﬂows belonging to the same host, but these imply only very limited behavioral similarity. Each one has its own unique sequence of packet sizes and inter-packet intervals with no obvious relation to each other. Without fundamentally behavioral elements of structure, trafﬁc traces are just “packet soup,” so to speak. Accordingly, traditional approaches to trafﬁc analysis have either focused on aggregate trafﬁc measures or categorized ﬂows by well-known port numbers and application types. Some ﬁendishly clever techniques have been proposed to tease out the applications types underlying trafﬁc found in network traces [1]. Without an inherently behavioral theory of network trafﬁc, however, we believe that insight into the ﬁne structure of network behavior is ultimately very limited. We propose to turn this problem on its head by providing an atomic behavioral theory for network trafﬁc. We begin with the concept of a packet ﬂow as a natural starting point and deﬁne a “ﬂowlet” as a segment of a ﬂow with statistically consistent packet size or inter-packet interval behavior. Common ﬂowlet behaviors are extracted from a body of trafﬁc traces, using sta- tistical clustering. These clusters of similarly behaved ﬂowlets provide the atomic units of ﬂow behavior: by mixing basic behaviors from a ﬁnite collection of ﬂowlet clusters in varying proportions, an inﬁnite variety of ﬂow behaviors emerge. It becomes natural to view the space of ﬂow behaviors as a vector space of linear combinations of ﬂowlet clusters. To demonstrate the viability and utility of this atomic theory of ﬂow behavior, we apply standard numerical linear algebra techniques to the resulting vector space of ﬂow behaviors. Principal component analysis (PCA) allows us to reduce the dimensionality of the vector space of ﬂow behaviors while preserving the vast majority (99%) of the variatibility in the original data. PCA allows us to represent ﬂow behaviors con- cisely in only eight dimensions (R 8 ). Moreover, this represen- tation has several desirable properties. The eight coordinates of ﬂows mapped into this space are linearly uncorrelated (non- linear dependencies, however, remain). The ﬁrst dimension of the transformed data captures the most important differences between ﬂows; the second dimensions the next most important, and so on. The ﬁrst two or three dimensions, thus, are ideal for visualization of behavioral differences. Finally, the dimensions are naturally scaled so that the standard Euclidean distance provides a good measure of the behavioral (dis)similarity be- tween ﬂows. As a result, standard multi-dimensional analysis and modeling techniques, such as k-means clustering, can be applied directly to the transformed ﬂow behaviors. We present two useful, immediate applications of this new theory of ﬂow behavior. First, once a body of trafﬁc has been analyzed and PCA-transformation matrices computed, new ﬂows can be mapped into the ﬂow-behavior space using only fast matrix and vector computations with pre-computed matrices. This allows the possibility of completely real-time trafﬁc analysis and visualization. Second, since the coordinates of the PCA-transformed ﬂow behaviors are uncorrelated and exhibit limited non-linear dependencies, we can roughly model them using a multivariate normal distribution. Random devi- ates from this distribution can be mapped back into actual ﬂow behaviors, allowing the generation of an entire network’s worth of heterogeneous synthetic ﬂow behaviors with the realistic intra-ﬂow and inter-ﬂow behavior matching the original trace trafﬁc. This ability is of particular importance for experimental wireless research, where it has been shown that using standard, na¨ ıve trafﬁc models, such as the uniform constant bit-rate (CBR) ﬂows, severely distorts important performance metrics at all levels of the network [2], [3]. II. MOTIVATION &RELATED WORK In Internet trafﬁc analysis, the detailed structure of workload patterns in local-area networks (LANs) is of limited inter- est. Capacity has become so cheap and plentiful in wired LANs that workload details are simply irrelevant in the face massive over-provisioning. Wireless networks, however, are fundamentally different: the entire medium is at the “edge” of the network and the most basic resources of bandwidth and