The DARPA SEARCHLIGHT Dataset of Application Network
Traffic
Calvin Ardi
USC/ISI
calvin@isi.edu
Connor Aubry
Sandia National Laboratories
caubry@sandia.gov
Brian Kocoloski
USC/ISI
bkocolos@isi.edu
David DeAngelis
USC/ISI
deangeli@isi.edu
Alefiya Hussain
USC/ISI
hussain@isi.edu
Matt Troglia
Sandia National Laboratories
mtrogli@sandia.gov
Stephen Schwab
USC/ISI
schwab@isi.edu
ABSTRACT
Researchers are in constant need of reliable data to develop and
evaluate AI/ML methods for networks and cybersecurity. While
Internet measurements can provide realistic data, such datasets
lack ground truth about application flows. We present a ∼750GB
dataset that includes ∼2000 systematically conducted experiments
and the resulting packet captures with video streaming, video tele-
conferencing, and cloud-based document editing applications. This
curated and labeled dataset has bidirectional and encrypted traffic
with complete ground truth that can be widely used for assessments
and evaluation of AI/ML algorithms.
CCS CONCEPTS
• Networks → Application layer protocols; Network experimen-
tation; Network measurement ; • Information systems → Internet
communications tools.
KEYWORDS
datasets, network experimentation, network traffic
ACM Reference Format:
Calvin Ardi, Connor Aubry, Brian Kocoloski, David DeAngelis, Alefiya Hus-
sain, Matt Troglia, and Stephen Schwab. 2022. The DARPA SEARCHLIGHT
Dataset of Application Network Traffic. In Cyber Security Experimentation
and Test Workshop (CSET 2022), August 8, 2022, Virtual, CA, USA. ACM, New
York, NY, USA, 6 pages. https://doi.org/10.1145/3546096.3546103
1 INTRODUCTION
Artificial intelligence and machine learning (AI/ML) methods are
widely used to understand and develop networked and distributed
systems. However, datasets to train and develop such systems are
scarce and require knowing the complete ground truth to evaluate
such methods. Additionally, there are many challenges in collecting
Publication rights licensed to ACM. ACM acknowledges that this contribution was
authored or co-authored by an employee, contractor or affiliate of the United States
government. As such, the Government retains a nonexclusive, royalty-free right to
publish or reproduce this article, or to allow others to do so, for Government purposes
only.
CSET 2022, August 8, 2022, Virtual, CA, USA
© 2022 Copyright held by the owner/author(s). Publication rights licensed to ACM.
ACM ISBN 978-1-4503-9684-4/22/08. . . $15.00
https://doi.org/10.1145/3546096.3546103
and curating datasets for network traffic. Each network has unique
characteristics with inherently stochastic traffic dynamics. Network
traffic can have anomalies and misbehaviors that complicate creat-
ing training data sets, and contain sensitive personally identifiable
information (PII) or intellectual property data, limiting wide access.
To help address the scarcity of publicly available networking
datasets and enable networking research, we present a network
traffic dataset that was systematically collected, curated, and la-
beled on an emulation testbed. Many researchers collect one-off
network traffic datasets, draw conclusions, subject the comparisons
to peer review, and publish results. While such paper-based datasets
allow for inference of properties and behavior, they do not sup-
port direct assessments. This dataset was collected for the DARPA
SEARCHLIGHT evaluation effort [7]. We believe that sharing this
dataset will enable repeatable and directly comparative assessments
of next generation networking technologies and applications in
cybersecurity, traffic engineering, and network measurement.
The DARPA SEARCHLIGHT dataset, while generated, as it was
collected on a emulation testbed, is a unique resource in several
ways. First, it is complete: the bidirectional traffic from all the
sources and destinations is captured. Second, it is labeled : all the
flows in the network traffic are identified and associated with an ap-
plication. Third, the traffic flows in this dataset have varying levels
of complexity. Some traffic captures have only one application flow
while some traffic captures have several simultaneous applications
and flows. Finally, the dataset contains multiple repeated samples to
account for the stochastic and dynamic nature of network traffic.
We believe that this combination of dataset features will enable a
wide range of AI/ML methods for network traffic analysis to be
systematically developed and evaluated.
The COVID-19 pandemic resulted in major shifts in Internet
traffic composition and patterns [25]. In building the dataset, we fo-
cused on using three contemporary traffic applications, video stream-
ing, video teleconferencing, and cloud-based services, over both
well-established transport protocols (TCP, UDP, HTTP) and the
recently standardized QUIC. Additionally, ascertaining information
in encrypted traffic, which has increased significantly with work-
from-home and remote work [13], is not possible in most datasets.
We include in this dataset a large collection of traces with IPsec [10]
and WireGuard [9] encryption.
59