DHS: Adaptive Memory Layout Organization of Sketch Slots for Fast and Accurate Data Stream Processing Bohan Zhao, Xiang Li, Boyu Tian, Zhiyu Mei, and Wenfei Wu Tsinghua University ABSTRACT Data stream processing is a crucial computation task in data min- ing applications. The rigid and fxed data structures in existing solutions limit their accuracy, throughput, and generality in mea- surement tasks. We propose Dynamic Hierarchical Sketch (DHS), a sketch-based hybrid solution targeting these properties. During the online stream processing, DHS hashes items to buckets and organizes cells in each bucket dynamically; the size of all cells in a bucket is adjusted adaptively to the actual size and distribution of fows. Thus, memory is efciently used to precisely record elephant fows and cover more mice fows. Implementation and evaluation show that DHS achieves high accuracy, high throughput, and high generality on fve measurement tasks: fow size estimation, fow size distribution estimation, heavy hitter detection, heavy changer detection, and entropy estimation. CCS CONCEPTS · Information systems Data stream mining. KEYWORDS Data stream processing; Approximate frequency estimation; Sketch ACM Reference Format: Bohan Zhao, Xiang Li, Boyu Tian, Zhiyu Mei, and Wenfei Wu. 2021. DHS: Adaptive Memory Layout Organization of Sketch Slots for Fast and Accurate Data Stream Processing. In Proceedings of the 27th ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD ’21), August 14ś18, 2021, Virtual Event, Singapore. ACM, New York, NY, USA, 9 pages. https://doi.org/ 10.1145/3447548.3467353 1 INTRODUCTION Massive data stream processing is a basic computation scheme in various applications such as network monitoring [13, 16, 19], sensor management [1], stock tickers [4], recommendation systems [11] and anomaly detection [29]. A data stream consists of a sequence of data items, each of which has an ID. A fow represents items with the same ID. The data stream processing algorithm takes in the stream and outputs statistical results on fow level (e.g., distribution, entropy). In the applications above, data are generated at a high Wenfei Wu is the corresponding author. Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for proft or commercial advantage and that copies bear this notice and the full citation on the frst page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specifc permission and/or a fee. Request permissions from permissions@acm.org. KDD ’21, August 14ś18,2021, Virtual Event, Singapore © 2021 Association for Computing Machinery. ACM ISBN 978-1-4503-8332-5/21/08. . . $15.00 https://doi.org/10.1145/3447548.3467353 rate, and long-term and high-volume storage is not afordable or outweighs its benefts. Thus, data stream processing algorithms store data temporarily in limited memory, process each item in one pass, and keep the statistics in their data structures for periodical or fnal queries. The data stream processing in various applications can be ab- stracted as fve typical measurement tasks: (1) fow size estimation, (2) fow size distribution estimation, (3) heavy hitter detection, (4) heavy changer detection, and (5) entropy estimation. For example, in the software platform, electing the hottest data fow (e.g., web clickstream) can help recommendation systems make decision [11] and guarantee the quality of service [30]; fltering out heavy hitters or heavy changers contributes to identifying attackers in DDoS defense [6, 12]; the frequency distribution or entropy of data stream refects the system state and can be applied to mine anomalies [29]. In the hardware platform such as programmable switches, the accu- racy of fnding top-k fows is critical to trafc ofoading [15]; and fow counting is a common component for network functions and their hardware implementations [9]. The stream processing structure (i.e., data structure and its read/write methods) plays a key role for performance in these tasks. It should provide an estimation of fow size, which could be further used for the measurement tasks. Various solutions for stream pro- cessing have been proposed in the past two decades, and each of them focuses on specifc measure tasks and has its performance emphasis (memory efciency, accuracy, or throughput). Existing solutions can be classifed into three categories: sketch-based solu- tions, counter-based solutions, and hybrid solutions. Sketch-based solutions provides fuzzy information of all fows with high through- put (count-min sketch [7], CU-sketch [10], reversible sketch [23] and Asketch [22]). Counter-based solutions record accurate infor- mation of elephant fows with poor throughput (space-saving [20], lossy counting [18] and unbiased space-saving [25]). And hybrid solutions combine key ideas of the above two solutions to make a trade-of between throughput and accuracy, and support more measurement tasks well (HeavyGuradian [26], Cold Filter [31], ElasticSketch [27], HeavyKeeper [28], WaveSketch [14]). We observe that all existing solutions organize the basic counting unit Ð sketch/counter slots Ð in a rigid and fxed layout. We propose to organize the slot size dynamically and adaptively to the actual fow size and distribution. Thus, the limited memory can be used more efciently, which further improves the quality of fow size measurement. Essentially, three factors afect the measurement quality Ð fow coverage (for tasks 2, 4, and 5), all fow size estimation (for tasks 2 and 5), and elephant fow size estimation (for tasks 1 and 3). If slot size can dynamically adapt to fow ID and its frequency, high-frequent fows can be ultimately preserved and the remaining memory can host more low-frequent fows. And three factors can be preserved and improved. Research Track Paper KDD ’21, August 14–18, 2021, Virtual Event, Singapore 2285