Online Identification of Hierarchical Heavy Hitters: Algorithms, Evaluation, and Applications Yin Zhang ⋆ Sumeet Singh § Subhabrata Sen ⋆ Nick Duffield ⋆ Carsten Lund ⋆ AT&T Labs – Research, Florham Park, NJ 07932, USA ⋆ CSE Department, University of California, San Diego, CA 92040, USA § {yzhang,sen,duffield,lund}@research.att.com susingh@cs.ucsd.edu ABSTRACT In traffic monitoring, accounting, and network anomaly detection, it is often important to be able to detect high-volume traffic clusters in near real-time. Such heavy-hitter traffic clusters are often hierarchi- cal (i.e., they may occur at different aggregation levels like ranges of IP addresses) and possibly multidimensional (i.e., they may involve the combination of different IP header fields like IP addresses, port numbers, and protocol). Without prior knowledge about the precise structures of such traffic clusters, a naive approach would require the monitoring system to examine all possible combinations of ag- gregates in order to detect the heavy hitters, which can be prohibitive in terms of computation resources. In this paper, we focus on online identification of 1-dimensional and 2-dimensional hierarchical heavy hitters (HHHs), arguably the two most important scenarios in traffic analysis. We show that the problem of HHH detection can be transformed to one of dynamic packet classification by taking a top-down approach and adaptively creating new rules to match HHHs. We then adapt several exist- ing static packet classification algorithms to support dynamic packet classification. The resulting HHH detection algorithms have much lower worst-case update costs than existing algorithms and can pro- vide tunable deterministic accuracy guarantees. As an application of these algorithms, we also propose robust techniques to detect changes among heavy-hitter traffic clusters. Our techniques can ac- commodate variability due to sampling that is increasingly used in network measurement. Evaluation based on real Internet traces col- lected at a Tier-1 ISP suggests that these techniques are remarkably accurate and efficient. Categories and Subject Descriptors C.2.3 [Computer-Communications Networks]: Network Opera- tions—Network Monitoring, Network Management General Terms Measurement, Algorithms Keywords Network Anomaly Detection, Data Stream Computation, Hierarchi- cal Heavy Hitters, Change Detection, Packet Classification Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. IMC’04, October 25–27, 2004, Taormina, Sicily, Italy. Copyright 2004 ACM 1-58113-821-0/04/0010 ...$5.00. 1. INTRODUCTION 1.1 Motivation and background The Internet has emerged as a critical communication infrastruc- ture, carrying traffic for a wide range of important scientific, busi- ness and consumer applications. Network service providers and enterprise network operators need the ability to detect anomalous events in the network, for network management and monitoring, reliability, security and performance reasons. While some traffic anomalies are relatively benign and tolerable, others can be symp- tomatic of potentially serious problems such as performance bot- tlenecks due to flash crowds [24], network element failures, ma- licious activities such as denial of service attacks (DoS) [23], and worm propagation [28]. It is therefore very important to be able to detect traffic anomalies accurately and in near real-time, to enable timely initiation of appropriate mitigation steps. This paper focuses on streaming techniques for enabling accurate, near real-time detec- tion of anomalies in IP network traffic data. A major challenge for anomaly detection is that traffic anomalies often have very complicated structures: they are often hierarchical (i.e., they may occur at arbitrary aggregation levels like ranges of IP addresses and port numbers) and sometimes also multidimensional (i.e., they can only be exposed when we examine traffic with spe- cific combinations of IP address ranges, port numbers, and proto- col). In order to identify such multidimensional hierarchical traffic anomalies, a naive approach would require the monitoring system to examine all possible combinations of aggregates, which can be prohibitive even for just two dimensions. Another challenge is the need to process massive streams of traffic data online and in near real-time. Given today’s traffic volume and link speeds, the input data stream can easily contain millions or more of concurrent flows, so it is often infeasible or too expensive to maintain per-flow state. 1.2 Heavy hitters, aggregation and hierarchies A very useful concept in identifying dominant or unusual traffic patterns is that of hierarchical heavy hitters (HHHs) [11]. A heavy hitter is an entity which accounts for at least a specified propor- tion of the total activity measured in terms of number of packets, bytes, connections etc. A heavy hitter could correspond to an indi- vidual flow or connection. It could also be an aggregation of multi- ple flows/connections that share some common property, but which themselves may not be heavy hitters. Of particular interest to our application is the notion of hierar- chical aggregation. IP addresses can be organized into a hierarchy according to prefix. The challenge for hierarchical aggregation is to efficiently compute the total activity of all traffic matching relevant prefixes. A hierarchical heavy hitter is a hierarchical aggregate that accounts for some specified proportion of the total activity. Aggregations can be defined on one or more dimensions, e.g., source IP address, destination IP address, source port, destination