The DARPA SEARCHLIGHT Dataset of Application Network Traﬀic Calvin Ardi USC/ISI calvin@isi.edu Connor Aubry Sandia National Laboratories caubry@sandia.gov Brian Kocoloski USC/ISI bkocolos@isi.edu David DeAngelis USC/ISI deangeli@isi.edu Aleﬁya Hussain USC/ISI hussain@isi.edu Matt Troglia Sandia National Laboratories mtrogli@sandia.gov Stephen Schwab USC/ISI schwab@isi.edu ABSTRACT Researchers are in constant need of reliable data to develop and evaluate AI/ML methods for networks and cybersecurity. While Internet measurements can provide realistic data, such datasets lack ground truth about application ﬂows. We present a ∼750GB dataset that includes ∼2000 systematically conducted experiments and the resulting packet captures with video streaming, video tele- conferencing, and cloud-based document editing applications. This curated and labeled dataset has bidirectional and encrypted traﬃc with complete ground truth that can be widely used for assessments and evaluation of AI/ML algorithms. CCS CONCEPTS • Networks → Application layer protocols; Network experimen- tation; Network measurement ; • Information systems → Internet communications tools. KEYWORDS datasets, network experimentation, network traﬃc ACM Reference Format: Calvin Ardi, Connor Aubry, Brian Kocoloski, David DeAngelis, Aleﬁya Hus- sain, Matt Troglia, and Stephen Schwab. 2022. The DARPA SEARCHLIGHT Dataset of Application Network Traﬃc. In Cyber Security Experimentation and Test Workshop (CSET 2022), August 8, 2022, Virtual, CA, USA. ACM, New York, NY, USA, 6 pages. https://doi.org/10.1145/3546096.3546103 1 INTRODUCTION Artiﬁcial intelligence and machine learning (AI/ML) methods are widely used to understand and develop networked and distributed systems. However, datasets to train and develop such systems are scarce and require knowing the complete ground truth to evaluate such methods. Additionally, there are many challenges in collecting Publication rights licensed to ACM. ACM acknowledges that this contribution was authored or co-authored by an employee, contractor or aﬃliate of the United States government. As such, the Government retains a nonexclusive, royalty-free right to publish or reproduce this article, or to allow others to do so, for Government purposes only. CSET 2022, August 8, 2022, Virtual, CA, USA © 2022 Copyright held by the owner/author(s). Publication rights licensed to ACM. ACM ISBN 978-1-4503-9684-4/22/08. . . $15.00 https://doi.org/10.1145/3546096.3546103 and curating datasets for network traﬃc. Each network has unique characteristics with inherently stochastic traﬃc dynamics. Network traﬃc can have anomalies and misbehaviors that complicate creat- ing training data sets, and contain sensitive personally identiﬁable information (PII) or intellectual property data, limiting wide access. To help address the scarcity of publicly available networking datasets and enable networking research, we present a network traﬃc dataset that was systematically collected, curated, and la- beled on an emulation testbed. Many researchers collect one-oﬀ network traﬃc datasets, draw conclusions, subject the comparisons to peer review, and publish results. While such paper-based datasets allow for inference of properties and behavior, they do not sup- port direct assessments. This dataset was collected for the DARPA SEARCHLIGHT evaluation eﬀort [7]. We believe that sharing this dataset will enable repeatable and directly comparative assessments of next generation networking technologies and applications in cybersecurity, traﬃc engineering, and network measurement. The DARPA SEARCHLIGHT dataset, while generated, as it was collected on a emulation testbed, is a unique resource in several ways. First, it is complete: the bidirectional traﬃc from all the sources and destinations is captured. Second, it is labeled : all the ﬂows in the network traﬃc are identiﬁed and associated with an ap- plication. Third, the traﬃc ﬂows in this dataset have varying levels of complexity. Some traﬃc captures have only one application ﬂow while some traﬃc captures have several simultaneous applications and ﬂows. Finally, the dataset contains multiple repeated samples to account for the stochastic and dynamic nature of network traﬃc. We believe that this combination of dataset features will enable a wide range of AI/ML methods for network traﬃc analysis to be systematically developed and evaluated. The COVID-19 pandemic resulted in major shifts in Internet traﬃc composition and patterns [25]. In building the dataset, we fo- cused on using three contemporary traﬃc applications, video stream- ing, video teleconferencing, and cloud-based services, over both well-established transport protocols (TCP, UDP, HTTP) and the recently standardized QUIC. Additionally, ascertaining information in encrypted traﬃc, which has increased signiﬁcantly with work- from-home and remote work [13], is not possible in most datasets. We include in this dataset a large collection of traces with IPsec [10] and WireGuard [9] encryption. 59