Exploring the Use of Autoencoders for Botnets Trafﬁc Representation Ruggiero Dargenio CSAIL MIT dargenio@mit.edu Shashank Srikant CSAIL MIT shash@csail.mit.edu Erik Hemberg CSAIL MIT hembergerik@csail.mit.edu Una-May O’Reilly CSAIL MIT unamay@csail.mit.edu Abstract—Botnets are a signiﬁcant threat to cyber security. Compromised, a.k.a. malicious hosts in a network have, of late, been detected by machine learning from hand-crafted features directly sourced from different types of network logs. Our interest is in automating feature engineering while examining ﬂow data from hosts labeled to be malicious or not. To auto- matically express full temporal character and dependencies of ﬂow data requires time windowing and a very high dimensional set of features, in our case 30,000. To reduce dimensionality, we generate a lower dimensional embedding (64 dimensions) via autoencoding. This improves detection. We next increase the volume in the ﬂows originating from hosts in our dataset known to be malicious or not by injecting noise we mix in from background trafﬁc. The resulting lower metaphorical signal to noise ratio makes the presence of a bot even more challenging to detect so we resort to a ﬁlter encoder or an off-the-shelf denoising autoencoder. Both the ﬁlter encoding and denoising autoencoder improve upon detection compared to when hand- crafted features are used and are comparable in performance to the autoencoder. 1. Introduction Botnets are a critical threat to cyber security [10]. A botnet setup consists of malicious software copied onto different devices, all of which are connected to the internet. Each compromised device is typically controlled from a central system to perform large scale, distributed attacks [7]. Recent works describe a number of approaches to de- tecting botnets [4], [8], [10]. Typically these systems refer- ence network activity. They analyze information present in packets or ﬂow logs and learn a malicious host detection model. Features are chosen by hand and consist mostly of information logged from standard network routers. Ad- vanced feature engineering to better express information present in logs is uncommon and requires signiﬁcant hu- man expertise and effort. Despite the cost, it promises to improve the detection of malicious hosts particularly when the network trafﬁc directed by the malicious software, i.e. signal, is deeply hidden in the background, i.e. noise, of its host’s normal network communications. Automated feature engineering methods supporting detection in the face of low signal to noise ratios are our central interest. The advent of deep learning systems has signiﬁcantly reduced feature engineering efforts in computer vision and natural language processing (NLP) [11], [14]. Given just an objective function and input data, these systems can learn reduced dimensionality representations that efﬁciently support supervised learning. In a majority of common vision and NLP tasks, these learned representations have been shown to clearly outperform models built on traditional, hand-engineered features. Encouraged by such positive results, we investigate how to likewise use deep learning to obtain reduced dimension- ality representations of network data that can support the detection of botnets. This automation will relieve the burden of hand-crafting features. While we demonstrate that an autoencoder can improve detection, our primary motivation is not to outperform state of the art detection systems but to develop a methodology that can handle trafﬁc of lower bot to background, i.e. signal to noise ratio. Therefore, in this work, we propose methods which would hold true irrespective of the speciﬁc choices of detection models and datasets one would use to build and test detection systems. We explore two related questions a) How can we automat- ically learn features which support the accurate detection of malicious hosts? b) How can we model “noisy” hosts that have background trafﬁc similar to real world conditions and ensure such automated feature learning systems support detection well even in such noisy environments? Speciﬁcally, our work makes the following contributions • We use ﬂow logs from the CTU-13 dataset [1] and show that organizing their features into time-windows improves a baseline detection model. We utilize this organization when learning features to detect botnets. • We design autoencoders to learn feature spaces from temporally ordered ﬂow data to better detect botnets. To the best of the authors’ knowledge, this is the ﬁrst work to demonstrate how traditional ﬂow log-based features can be improved upon by using autoencoding. • In an attempt to encourage the community to build more robust detectors, we also suggest a way to model noisier host trafﬁc conditions by exploiting ignored background trafﬁc within this dataset. • We present an encoder architecture, named ﬁlter en- coders, which extracts features from a noisy model of host trafﬁc. Detectors trained on these extracted 57 2018 IEEE Symposium on Security and Privacy Workshops © 2018, Ruggiero Dargenio. Under license to IEEE. DOI 10.1109/SPW.2018.00017