An Atomic Theory of Flow Behavior Stefan Karpinski, Elizabeth M. Belding, Kevin C. Almeroth Department of Computer Science University of California, Santa Barbara {sgk,ebelding,almeroth}@cs.ucsb.edu Abstract—We propose an entirely new approach to under- standing and analyzing the behavior of flows in packet networks. The essential concept of this new approach is to find atomic units of time and size behavior in traces of network flows. While the nonparametric statistical techniques for extracting these basic behavioral units are complex, the end result is quite simple: the units provide an alphabet for flow behavior. From a finite set of behavioral units, an infinite variety of actual behaviors can be composed. The space of behaviors naturally becomes a vector space generated by these atomic units of behavior. We use numerical linear algebra techniques to demonstrate useful and immediate applications of this theory to real-time traffic analysis, anomaly and attack detection, as well as workload generation for wireless experiments. I. I NTRODUCTION The trouble with trying to understand or model behav- ioral patterns in packet networks is that beyond the packets themselves there is no inherent behavioral structure. There are flows of packets with the same source and destination IP addresses and TCP/UDP port numbers, and sessions of flows belonging to the same host, but these imply only very limited behavioral similarity. Each one has its own unique sequence of packet sizes and inter-packet intervals with no obvious relation to each other. Without fundamentally behavioral elements of structure, traffic traces are just “packet soup,” so to speak. Accordingly, traditional approaches to traffic analysis have either focused on aggregate traffic measures or categorized flows by well-known port numbers and application types. Some fiendishly clever techniques have been proposed to tease out the applications types underlying traffic found in network traces [1]. Without an inherently behavioral theory of network traffic, however, we believe that insight into the fine structure of network behavior is ultimately very limited. We propose to turn this problem on its head by providing an atomic behavioral theory for network traffic. We begin with the concept of a packet flow as a natural starting point and define a “flowlet” as a segment of a flow with statistically consistent packet size or inter-packet interval behavior. Common flowlet behaviors are extracted from a body of traffic traces, using sta- tistical clustering. These clusters of similarly behaved flowlets provide the atomic units of flow behavior: by mixing basic behaviors from a finite collection of flowlet clusters in varying proportions, an infinite variety of flow behaviors emerge. It becomes natural to view the space of flow behaviors as a vector space of linear combinations of flowlet clusters. To demonstrate the viability and utility of this atomic theory of flow behavior, we apply standard numerical linear algebra techniques to the resulting vector space of flow behaviors. Principal component analysis (PCA) allows us to reduce the dimensionality of the vector space of flow behaviors while preserving the vast majority (99%) of the variatibility in the original data. PCA allows us to represent flow behaviors con- cisely in only eight dimensions (R 8 ). Moreover, this represen- tation has several desirable properties. The eight coordinates of flows mapped into this space are linearly uncorrelated (non- linear dependencies, however, remain). The first dimension of the transformed data captures the most important differences between flows; the second dimensions the next most important, and so on. The first two or three dimensions, thus, are ideal for visualization of behavioral differences. Finally, the dimensions are naturally scaled so that the standard Euclidean distance provides a good measure of the behavioral (dis)similarity be- tween flows. As a result, standard multi-dimensional analysis and modeling techniques, such as k-means clustering, can be applied directly to the transformed flow behaviors. We present two useful, immediate applications of this new theory of flow behavior. First, once a body of traffic has been analyzed and PCA-transformation matrices computed, new flows can be mapped into the flow-behavior space using only fast matrix and vector computations with pre-computed matrices. This allows the possibility of completely real-time traffic analysis and visualization. Second, since the coordinates of the PCA-transformed flow behaviors are uncorrelated and exhibit limited non-linear dependencies, we can roughly model them using a multivariate normal distribution. Random devi- ates from this distribution can be mapped back into actual flow behaviors, allowing the generation of an entire network’s worth of heterogeneous synthetic flow behaviors with the realistic intra-flow and inter-flow behavior matching the original trace traffic. This ability is of particular importance for experimental wireless research, where it has been shown that using standard, na¨ ıve traffic models, such as the uniform constant bit-rate (CBR) flows, severely distorts important performance metrics at all levels of the network [2], [3]. II. MOTIVATION &RELATED WORK In Internet traffic analysis, the detailed structure of workload patterns in local-area networks (LANs) is of limited inter- est. Capacity has become so cheap and plentiful in wired LANs that workload details are simply irrelevant in the face massive over-provisioning. Wireless networks, however, are fundamentally different: the entire medium is at the “edge” of the network and the most basic resources of bandwidth and