Exploring the Use of Autoencoders for Botnets Traffic Representation
Ruggiero Dargenio
CSAIL
MIT
dargenio@mit.edu
Shashank Srikant
CSAIL
MIT
shash@csail.mit.edu
Erik Hemberg
CSAIL
MIT
hembergerik@csail.mit.edu
Una-May O’Reilly
CSAIL
MIT
unamay@csail.mit.edu
Abstract—Botnets are a significant threat to cyber security.
Compromised, a.k.a. malicious hosts in a network have, of late,
been detected by machine learning from hand-crafted features
directly sourced from different types of network logs. Our
interest is in automating feature engineering while examining
flow data from hosts labeled to be malicious or not. To auto-
matically express full temporal character and dependencies of
flow data requires time windowing and a very high dimensional
set of features, in our case 30,000. To reduce dimensionality,
we generate a lower dimensional embedding (64 dimensions)
via autoencoding. This improves detection. We next increase
the volume in the flows originating from hosts in our dataset
known to be malicious or not by injecting noise we mix in from
background traffic. The resulting lower metaphorical signal to
noise ratio makes the presence of a bot even more challenging
to detect so we resort to a filter encoder or an off-the-shelf
denoising autoencoder. Both the filter encoding and denoising
autoencoder improve upon detection compared to when hand-
crafted features are used and are comparable in performance
to the autoencoder.
1. Introduction
Botnets are a critical threat to cyber security [10]. A
botnet setup consists of malicious software copied onto
different devices, all of which are connected to the internet.
Each compromised device is typically controlled from a
central system to perform large scale, distributed attacks [7].
Recent works describe a number of approaches to de-
tecting botnets [4], [8], [10]. Typically these systems refer-
ence network activity. They analyze information present in
packets or flow logs and learn a malicious host detection
model. Features are chosen by hand and consist mostly
of information logged from standard network routers. Ad-
vanced feature engineering to better express information
present in logs is uncommon and requires significant hu-
man expertise and effort. Despite the cost, it promises to
improve the detection of malicious hosts particularly when
the network traffic directed by the malicious software, i.e.
signal, is deeply hidden in the background, i.e. noise, of its
host’s normal network communications. Automated feature
engineering methods supporting detection in the face of low
signal to noise ratios are our central interest.
The advent of deep learning systems has significantly
reduced feature engineering efforts in computer vision and
natural language processing (NLP) [11], [14]. Given just
an objective function and input data, these systems can
learn reduced dimensionality representations that efficiently
support supervised learning. In a majority of common vision
and NLP tasks, these learned representations have been
shown to clearly outperform models built on traditional,
hand-engineered features.
Encouraged by such positive results, we investigate how
to likewise use deep learning to obtain reduced dimension-
ality representations of network data that can support the
detection of botnets. This automation will relieve the burden
of hand-crafting features. While we demonstrate that an
autoencoder can improve detection, our primary motivation
is not to outperform state of the art detection systems but
to develop a methodology that can handle traffic of lower
bot to background, i.e. signal to noise ratio. Therefore,
in this work, we propose methods which would hold true
irrespective of the specific choices of detection models and
datasets one would use to build and test detection systems.
We explore two related questions a) How can we automat-
ically learn features which support the accurate detection
of malicious hosts? b) How can we model “noisy” hosts
that have background traffic similar to real world conditions
and ensure such automated feature learning systems support
detection well even in such noisy environments?
Specifically, our work makes the following contributions
• We use flow logs from the CTU-13 dataset [1] and
show that organizing their features into time-windows
improves a baseline detection model. We utilize this
organization when learning features to detect botnets.
• We design autoencoders to learn feature spaces from
temporally ordered flow data to better detect botnets. To
the best of the authors’ knowledge, this is the first work
to demonstrate how traditional flow log-based features
can be improved upon by using autoencoding.
• In an attempt to encourage the community to build
more robust detectors, we also suggest a way to model
noisier host traffic conditions by exploiting ignored
background traffic within this dataset.
• We present an encoder architecture, named filter en-
coders, which extracts features from a noisy model
of host traffic. Detectors trained on these extracted
57
2018 IEEE Symposium on Security and Privacy Workshops
© 2018, Ruggiero Dargenio. Under license to IEEE.
DOI 10.1109/SPW.2018.00017