Online Identiﬁcation of Hierarchical Heavy Hitters: Algorithms, Evaluation, and Applications Yin Zhang ⋆ Sumeet Singh § Subhabrata Sen ⋆ Nick Dufﬁeld ⋆ Carsten Lund ⋆ AT&T Labs – Research, Florham Park, NJ 07932, USA ⋆ CSE Department, University of California, San Diego, CA 92040, USA § {yzhang,sen,dufﬁeld,lund}@research.att.com susingh@cs.ucsd.edu ABSTRACT In trafﬁc monitoring, accounting, and network anomaly detection, it is often important to be able to detect high-volume trafﬁc clusters in near real-time. Such heavy-hitter trafﬁc clusters are often hierarchi- cal (i.e., they may occur at different aggregation levels like ranges of IP addresses) and possibly multidimensional (i.e., they may involve the combination of different IP header ﬁelds like IP addresses, port numbers, and protocol). Without prior knowledge about the precise structures of such trafﬁc clusters, a naive approach would require the monitoring system to examine all possible combinations of ag- gregates in order to detect the heavy hitters, which can be prohibitive in terms of computation resources. In this paper, we focus on online identiﬁcation of 1-dimensional and 2-dimensional hierarchical heavy hitters (HHHs), arguably the two most important scenarios in trafﬁc analysis. We show that the problem of HHH detection can be transformed to one of dynamic packet classiﬁcation by taking a top-down approach and adaptively creating new rules to match HHHs. We then adapt several exist- ing static packet classiﬁcation algorithms to support dynamic packet classiﬁcation. The resulting HHH detection algorithms have much lower worst-case update costs than existing algorithms and can pro- vide tunable deterministic accuracy guarantees. As an application of these algorithms, we also propose robust techniques to detect changes among heavy-hitter trafﬁc clusters. Our techniques can ac- commodate variability due to sampling that is increasingly used in network measurement. Evaluation based on real Internet traces col- lected at a Tier-1 ISP suggests that these techniques are remarkably accurate and efﬁcient. Categories and Subject Descriptors C.2.3 [Computer-Communications Networks]: Network Opera- tions—Network Monitoring, Network Management General Terms Measurement, Algorithms Keywords Network Anomaly Detection, Data Stream Computation, Hierarchi- cal Heavy Hitters, Change Detection, Packet Classiﬁcation Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for proﬁt or commercial advantage and that copies bear this notice and the full citation on the ﬁrst page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior speciﬁc permission and/or a fee. IMC’04, October 25–27, 2004, Taormina, Sicily, Italy. Copyright 2004 ACM 1-58113-821-0/04/0010 ...$5.00. 1. INTRODUCTION 1.1 Motivation and background The Internet has emerged as a critical communication infrastruc- ture, carrying trafﬁc for a wide range of important scientiﬁc, busi- ness and consumer applications. Network service providers and enterprise network operators need the ability to detect anomalous events in the network, for network management and monitoring, reliability, security and performance reasons. While some trafﬁc anomalies are relatively benign and tolerable, others can be symp- tomatic of potentially serious problems such as performance bot- tlenecks due to ﬂash crowds [24], network element failures, ma- licious activities such as denial of service attacks (DoS) [23], and worm propagation [28]. It is therefore very important to be able to detect trafﬁc anomalies accurately and in near real-time, to enable timely initiation of appropriate mitigation steps. This paper focuses on streaming techniques for enabling accurate, near real-time detec- tion of anomalies in IP network trafﬁc data. A major challenge for anomaly detection is that trafﬁc anomalies often have very complicated structures: they are often hierarchical (i.e., they may occur at arbitrary aggregation levels like ranges of IP addresses and port numbers) and sometimes also multidimensional (i.e., they can only be exposed when we examine trafﬁc with spe- ciﬁc combinations of IP address ranges, port numbers, and proto- col). In order to identify such multidimensional hierarchical trafﬁc anomalies, a naive approach would require the monitoring system to examine all possible combinations of aggregates, which can be prohibitive even for just two dimensions. Another challenge is the need to process massive streams of trafﬁc data online and in near real-time. Given today’s trafﬁc volume and link speeds, the input data stream can easily contain millions or more of concurrent ﬂows, so it is often infeasible or too expensive to maintain per-ﬂow state. 1.2 Heavy hitters, aggregation and hierarchies A very useful concept in identifying dominant or unusual trafﬁc patterns is that of hierarchical heavy hitters (HHHs) [11]. A heavy hitter is an entity which accounts for at least a speciﬁed propor- tion of the total activity measured in terms of number of packets, bytes, connections etc. A heavy hitter could correspond to an indi- vidual ﬂow or connection. It could also be an aggregation of multi- ple ﬂows/connections that share some common property, but which themselves may not be heavy hitters. Of particular interest to our application is the notion of hierar- chical aggregation. IP addresses can be organized into a hierarchy according to preﬁx. The challenge for hierarchical aggregation is to efﬁciently compute the total activity of all trafﬁc matching relevant preﬁxes. A hierarchical heavy hitter is a hierarchical aggregate that accounts for some speciﬁed proportion of the total activity. Aggregations can be deﬁned on one or more dimensions, e.g., source IP address, destination IP address, source port, destination