DHS: Adaptive Memory Layout Organization of Sketch Slots for
Fast and Accurate Data Stream Processing
Bohan Zhao, Xiang Li, Boyu Tian, Zhiyu Mei, and Wenfei Wu
∗
Tsinghua University
ABSTRACT
Data stream processing is a crucial computation task in data min-
ing applications. The rigid and fxed data structures in existing
solutions limit their accuracy, throughput, and generality in mea-
surement tasks. We propose Dynamic Hierarchical Sketch (DHS),
a sketch-based hybrid solution targeting these properties. During
the online stream processing, DHS hashes items to buckets and
organizes cells in each bucket dynamically; the size of all cells in a
bucket is adjusted adaptively to the actual size and distribution of
fows. Thus, memory is efciently used to precisely record elephant
fows and cover more mice fows. Implementation and evaluation
show that DHS achieves high accuracy, high throughput, and high
generality on fve measurement tasks: fow size estimation, fow
size distribution estimation, heavy hitter detection, heavy changer
detection, and entropy estimation.
CCS CONCEPTS
· Information systems → Data stream mining.
KEYWORDS
Data stream processing; Approximate frequency estimation; Sketch
ACM Reference Format:
Bohan Zhao, Xiang Li, Boyu Tian, Zhiyu Mei, and Wenfei Wu. 2021. DHS:
Adaptive Memory Layout Organization of Sketch Slots for Fast and Accurate
Data Stream Processing. In Proceedings of the 27th ACM SIGKDD Conference
on Knowledge Discovery and Data Mining (KDD ’21), August 14ś18, 2021,
Virtual Event, Singapore. ACM, New York, NY, USA, 9 pages. https://doi.org/
10.1145/3447548.3467353
1 INTRODUCTION
Massive data stream processing is a basic computation scheme in
various applications such as network monitoring [13, 16, 19], sensor
management [1], stock tickers [4], recommendation systems [11]
and anomaly detection [29]. A data stream consists of a sequence of
data items, each of which has an ID. A fow represents items with
the same ID. The data stream processing algorithm takes in the
stream and outputs statistical results on fow level (e.g., distribution,
entropy). In the applications above, data are generated at a high
∗
Wenfei Wu is the corresponding author.
Permission to make digital or hard copies of all or part of this work for personal or
classroom use is granted without fee provided that copies are not made or distributed
for proft or commercial advantage and that copies bear this notice and the full citation
on the frst page. Copyrights for components of this work owned by others than ACM
must be honored. Abstracting with credit is permitted. To copy otherwise, or republish,
to post on servers or to redistribute to lists, requires prior specifc permission and/or a
fee. Request permissions from permissions@acm.org.
KDD ’21, August 14ś18,2021, Virtual Event, Singapore
© 2021 Association for Computing Machinery.
ACM ISBN 978-1-4503-8332-5/21/08. . . $15.00
https://doi.org/10.1145/3447548.3467353
rate, and long-term and high-volume storage is not afordable or
outweighs its benefts. Thus, data stream processing algorithms
store data temporarily in limited memory, process each item in one
pass, and keep the statistics in their data structures for periodical
or fnal queries.
The data stream processing in various applications can be ab-
stracted as fve typical measurement tasks: (1) fow size estimation,
(2) fow size distribution estimation, (3) heavy hitter detection, (4)
heavy changer detection, and (5) entropy estimation. For example,
in the software platform, electing the hottest data fow (e.g., web
clickstream) can help recommendation systems make decision [11]
and guarantee the quality of service [30]; fltering out heavy hitters
or heavy changers contributes to identifying attackers in DDoS
defense [6, 12]; the frequency distribution or entropy of data stream
refects the system state and can be applied to mine anomalies [29].
In the hardware platform such as programmable switches, the accu-
racy of fnding top-k fows is critical to trafc ofoading [15]; and
fow counting is a common component for network functions and
their hardware implementations [9].
The stream processing structure (i.e., data structure and its
read/write methods) plays a key role for performance in these tasks.
It should provide an estimation of fow size, which could be further
used for the measurement tasks. Various solutions for stream pro-
cessing have been proposed in the past two decades, and each of
them focuses on specifc measure tasks and has its performance
emphasis (memory efciency, accuracy, or throughput). Existing
solutions can be classifed into three categories: sketch-based solu-
tions, counter-based solutions, and hybrid solutions. Sketch-based
solutions provides fuzzy information of all fows with high through-
put (count-min sketch [7], CU-sketch [10], reversible sketch [23]
and Asketch [22]). Counter-based solutions record accurate infor-
mation of elephant fows with poor throughput (space-saving [20],
lossy counting [18] and unbiased space-saving [25]). And hybrid
solutions combine key ideas of the above two solutions to make
a trade-of between throughput and accuracy, and support more
measurement tasks well (HeavyGuradian [26], Cold Filter [31],
ElasticSketch [27], HeavyKeeper [28], WaveSketch [14]).
We observe that all existing solutions organize the basic counting
unit Ð sketch/counter slots Ð in a rigid and fxed layout. We propose
to organize the slot size dynamically and adaptively to the actual
fow size and distribution. Thus, the limited memory can be used
more efciently, which further improves the quality of fow size
measurement. Essentially, three factors afect the measurement
quality Ð fow coverage (for tasks 2, 4, and 5), all fow size estimation
(for tasks 2 and 5), and elephant fow size estimation (for tasks 1 and
3). If slot size can dynamically adapt to fow ID and its frequency,
high-frequent fows can be ultimately preserved and the remaining
memory can host more low-frequent fows. And three factors can
be preserved and improved.
Research Track Paper KDD ’21, August 14–18, 2021, Virtual Event, Singapore
2285