The DARPA SEARCHLIGHT Dataset of Application Network Traffic Calvin Ardi USC/ISI calvin@isi.edu Connor Aubry Sandia National Laboratories caubry@sandia.gov Brian Kocoloski USC/ISI bkocolos@isi.edu David DeAngelis USC/ISI deangeli@isi.edu Alefiya Hussain USC/ISI hussain@isi.edu Matt Troglia Sandia National Laboratories mtrogli@sandia.gov Stephen Schwab USC/ISI schwab@isi.edu ABSTRACT Researchers are in constant need of reliable data to develop and evaluate AI/ML methods for networks and cybersecurity. While Internet measurements can provide realistic data, such datasets lack ground truth about application flows. We present a 750GB dataset that includes 2000 systematically conducted experiments and the resulting packet captures with video streaming, video tele- conferencing, and cloud-based document editing applications. This curated and labeled dataset has bidirectional and encrypted traffic with complete ground truth that can be widely used for assessments and evaluation of AI/ML algorithms. CCS CONCEPTS Networks Application layer protocols; Network experimen- tation; Network measurement ; Information systems Internet communications tools. KEYWORDS datasets, network experimentation, network traffic ACM Reference Format: Calvin Ardi, Connor Aubry, Brian Kocoloski, David DeAngelis, Alefiya Hus- sain, Matt Troglia, and Stephen Schwab. 2022. The DARPA SEARCHLIGHT Dataset of Application Network Traffic. In Cyber Security Experimentation and Test Workshop (CSET 2022), August 8, 2022, Virtual, CA, USA. ACM, New York, NY, USA, 6 pages. https://doi.org/10.1145/3546096.3546103 1 INTRODUCTION Artificial intelligence and machine learning (AI/ML) methods are widely used to understand and develop networked and distributed systems. However, datasets to train and develop such systems are scarce and require knowing the complete ground truth to evaluate such methods. Additionally, there are many challenges in collecting Publication rights licensed to ACM. ACM acknowledges that this contribution was authored or co-authored by an employee, contractor or affiliate of the United States government. As such, the Government retains a nonexclusive, royalty-free right to publish or reproduce this article, or to allow others to do so, for Government purposes only. CSET 2022, August 8, 2022, Virtual, CA, USA © 2022 Copyright held by the owner/author(s). Publication rights licensed to ACM. ACM ISBN 978-1-4503-9684-4/22/08. . . $15.00 https://doi.org/10.1145/3546096.3546103 and curating datasets for network traffic. Each network has unique characteristics with inherently stochastic traffic dynamics. Network traffic can have anomalies and misbehaviors that complicate creat- ing training data sets, and contain sensitive personally identifiable information (PII) or intellectual property data, limiting wide access. To help address the scarcity of publicly available networking datasets and enable networking research, we present a network traffic dataset that was systematically collected, curated, and la- beled on an emulation testbed. Many researchers collect one-off network traffic datasets, draw conclusions, subject the comparisons to peer review, and publish results. While such paper-based datasets allow for inference of properties and behavior, they do not sup- port direct assessments. This dataset was collected for the DARPA SEARCHLIGHT evaluation effort [7]. We believe that sharing this dataset will enable repeatable and directly comparative assessments of next generation networking technologies and applications in cybersecurity, traffic engineering, and network measurement. The DARPA SEARCHLIGHT dataset, while generated, as it was collected on a emulation testbed, is a unique resource in several ways. First, it is complete: the bidirectional traffic from all the sources and destinations is captured. Second, it is labeled : all the flows in the network traffic are identified and associated with an ap- plication. Third, the traffic flows in this dataset have varying levels of complexity. Some traffic captures have only one application flow while some traffic captures have several simultaneous applications and flows. Finally, the dataset contains multiple repeated samples to account for the stochastic and dynamic nature of network traffic. We believe that this combination of dataset features will enable a wide range of AI/ML methods for network traffic analysis to be systematically developed and evaluated. The COVID-19 pandemic resulted in major shifts in Internet traffic composition and patterns [25]. In building the dataset, we fo- cused on using three contemporary traffic applications, video stream- ing, video teleconferencing, and cloud-based services, over both well-established transport protocols (TCP, UDP, HTTP) and the recently standardized QUIC. Additionally, ascertaining information in encrypted traffic, which has increased significantly with work- from-home and remote work [13], is not possible in most datasets. We include in this dataset a large collection of traces with IPsec [10] and WireGuard [9] encryption. 59