Exploring the Use of Autoencoders for Botnets Traffic Representation Ruggiero Dargenio CSAIL MIT dargenio@mit.edu Shashank Srikant CSAIL MIT shash@csail.mit.edu Erik Hemberg CSAIL MIT hembergerik@csail.mit.edu Una-May O’Reilly CSAIL MIT unamay@csail.mit.edu Abstract—Botnets are a significant threat to cyber security. Compromised, a.k.a. malicious hosts in a network have, of late, been detected by machine learning from hand-crafted features directly sourced from different types of network logs. Our interest is in automating feature engineering while examining flow data from hosts labeled to be malicious or not. To auto- matically express full temporal character and dependencies of flow data requires time windowing and a very high dimensional set of features, in our case 30,000. To reduce dimensionality, we generate a lower dimensional embedding (64 dimensions) via autoencoding. This improves detection. We next increase the volume in the flows originating from hosts in our dataset known to be malicious or not by injecting noise we mix in from background traffic. The resulting lower metaphorical signal to noise ratio makes the presence of a bot even more challenging to detect so we resort to a filter encoder or an off-the-shelf denoising autoencoder. Both the filter encoding and denoising autoencoder improve upon detection compared to when hand- crafted features are used and are comparable in performance to the autoencoder. 1. Introduction Botnets are a critical threat to cyber security [10]. A botnet setup consists of malicious software copied onto different devices, all of which are connected to the internet. Each compromised device is typically controlled from a central system to perform large scale, distributed attacks [7]. Recent works describe a number of approaches to de- tecting botnets [4], [8], [10]. Typically these systems refer- ence network activity. They analyze information present in packets or flow logs and learn a malicious host detection model. Features are chosen by hand and consist mostly of information logged from standard network routers. Ad- vanced feature engineering to better express information present in logs is uncommon and requires significant hu- man expertise and effort. Despite the cost, it promises to improve the detection of malicious hosts particularly when the network traffic directed by the malicious software, i.e. signal, is deeply hidden in the background, i.e. noise, of its host’s normal network communications. Automated feature engineering methods supporting detection in the face of low signal to noise ratios are our central interest. The advent of deep learning systems has significantly reduced feature engineering efforts in computer vision and natural language processing (NLP) [11], [14]. Given just an objective function and input data, these systems can learn reduced dimensionality representations that efficiently support supervised learning. In a majority of common vision and NLP tasks, these learned representations have been shown to clearly outperform models built on traditional, hand-engineered features. Encouraged by such positive results, we investigate how to likewise use deep learning to obtain reduced dimension- ality representations of network data that can support the detection of botnets. This automation will relieve the burden of hand-crafting features. While we demonstrate that an autoencoder can improve detection, our primary motivation is not to outperform state of the art detection systems but to develop a methodology that can handle traffic of lower bot to background, i.e. signal to noise ratio. Therefore, in this work, we propose methods which would hold true irrespective of the specific choices of detection models and datasets one would use to build and test detection systems. We explore two related questions a) How can we automat- ically learn features which support the accurate detection of malicious hosts? b) How can we model “noisy” hosts that have background traffic similar to real world conditions and ensure such automated feature learning systems support detection well even in such noisy environments? Specifically, our work makes the following contributions We use flow logs from the CTU-13 dataset [1] and show that organizing their features into time-windows improves a baseline detection model. We utilize this organization when learning features to detect botnets. We design autoencoders to learn feature spaces from temporally ordered flow data to better detect botnets. To the best of the authors’ knowledge, this is the first work to demonstrate how traditional flow log-based features can be improved upon by using autoencoding. In an attempt to encourage the community to build more robust detectors, we also suggest a way to model noisier host traffic conditions by exploiting ignored background traffic within this dataset. We present an encoder architecture, named filter en- coders, which extracts features from a noisy model of host traffic. Detectors trained on these extracted 57 2018 IEEE Symposium on Security and Privacy Workshops © 2018, Ruggiero Dargenio. Under license to IEEE. DOI 10.1109/SPW.2018.00017