Space-Code Bloom Filter for Efficient Per-Flow Traffic Measurement Abhishek Kumar ∗ , Jun (Jim) Xu ∗ , Jia Wang † , Oliver Spatschek † , Li (Erran) Li ‡ ∗ College of Computing Georgia Institute of Technology Atlanta, Georgia 30332–0280 {akumar,jx}@cc.gatech.edu † AT&T Labs - Research Florham Park, NJ 07932-0971 {jiawang,spatsch}@research.att.com ‡ Bell Labs Lucent Technologies Holmdel, NJ 07733-3030 erranlli@bell-labs.com Abstract— Per-flow traffic measurement is critical for usage accounting, traffic engineering, and anomaly detection. Previ- ous methodologies are either based on random sampling (e.g., Cisco’s NetFlow), which is inaccurate, or only account for the “elephants”. We introduce a novel technique for measuring per- flow traffic approximately, for all flows regardless of their sizes, at very high-speed (say, OC768). The core of this technique is a novel data structure called Space Code Bloom Filter (SCBF). A SCBF is an approximate representation of a multiset; each element in this multiset is a traffic flow and its multiplicity is the number of packets in the flow. The multiplicity of an element in the multiset represented by SCBF can be estimated through either of two mechanisms – Maximum Likelihood Estimation (MLE) or Mean Value Estimation (MVE). Through parameter tuning, SCBF allows for graceful tradeoff between measurement accuracy and computational and storage complexity. SCBF also contributes to the foundation of data streaming by introducing a new paradigm called blind streaming. We evaluate the performance of SCBF through mathematical analysis and through experiments on packet traces gathered from a tier-1 ISP backbone. Our results demonstrate that SCBF achieves reasonable measurement accuracy with very low storage and computational complexity. Index Terms— Network Measurement, Traffic Analysis, Data Structures, Statistical Inference, Bloom Filter. I. I NTRODUCTION Accurate traffic measurement and monitoring is critical for network management. For example, per-flow traffic account- ing has applications in usage-based charging/pricing, network anomaly detection, security, and traffic engineering [1]. While there has been considerable research on characterizing the statistical distribution of per-flow traffic [2] or on identifying and measuring a few large flows (elephants) [1], [3], [4], little work has been done on investigating highly efficient algo- rithms and data structures to facilitate per-flow measurement on very high-speed links. To fill this gap, we propose a novel data structure called Space-Code Bloom Filter (SCBF) and explore its applications to network measurement in general, and to per-flow traffic accounting in particular. A (traditional) bloom filter [5] is an approximate representation of a set S, which given an arbitrary This work was supported in part by the National Science Foundation under Grant ITR/SY ANI-0113933 and under NSF CAREER Award Grant ANI- 0238315. element x, allows for the membership query “x ∈ S?”. A Space-Code Bloom Filter (SCBF), on the other hand, is an approximate representation of a multiset M , which allows for the query “how many occurrences of x are there in M ?”. Just as a bloom filter achieves a nice tradeoff between space efficiency (bits per element) and the ratio of false positives, SCBF achieves a nice tradeoff between the accuracy of counting and the number of bits used for counting. SCBF has several important applications in network mea- surement. This paper focuses on its application to performing “per-flow” traffic accounting without per flow state on a high- speed link. Given a flow identifier, SCBF returns the estimated number of packets in the flow during a measurement epoch. Here, a flow identifier can be an IP address, a source and destination IP address pair, the combination of IP addresses and port numbers, or other attributes that can identify a flow. Per-flow accounting is a challenging task on high-speed network links. While keeping per-flow state would make accounting straightforward, it is not desirable since such a large state will only fit on DRAM and the DRAM speed can not keep up with the rate of a high-speed link. While random sampling, such as used in Cisco Netflow, reduces the requirement on memory speed, it introduces excessive measurement errors for flows other than elephants, as shown in Section II. Our approach is to perform traffic accounting on a very small amount of high-speed SRAM, organized as an SCBF page. Once an SCBF page becomes full (we formalize this notion later), it is eventually paged to persistent storages such as disks. Later, to find out the traffic volume of a flow identified by a label x during a measurement epoch, the SCBF pages corresponding to the epoch can be queried using x to provide the approximate answer. The challenges facing this approach are threefold. First, the amount of persistent storage to store SCBF pages cannot be unreasonably large, even for a high-speed link like OC-768 (40 Gbps). Second, the computational complexity of processing each packet needs to be low enough to catch up with the link speed. Third, the accounting needs to be fairly accurate for all the flows, despite the aforementioned storage and complexity constraints. SCBF is designed to meet all these challenges. Our design 0-7803-8356-7/04/$20.00 (C) 2004 IEEE IEEE INFOCOM 2004