Discovering Malware with Time Series Shapelets Om P. Patri University of Southern California Los Angeles, CA 90089 patri@usc.edu Michael T. Wojnowicz Cylance Inc. Irvine, CA 92612 mwojnowicz@cylance.com Matt Wolff Cylance Inc. Irvine, CA 92612 mwolff@cylance.com Abstract Malicious software (‘malware’) detection systems are usually signature-based and cannot stop attacks by malicious files they have never encountered. To stop these attacks, we need statistical learning approaches to identify root patterns behind execution of malware. We propose a machine learning approach for detection of malware from portable executable (PE) files. We create an ‘entropy time series’ representation of the content of each file, and then apply a unique time series classification method (called ‘shapelets’) for identifying malware. The shapelet-based approach picks up local discriminative features from the entropy signals. Our approach is file format agnostic, can deal with varying lengths in input instances, and provides fast classification. We evaluate our method on an industrial dataset containing thousands of executable files, and comparison with state-of-the-art methods illustrates the performance of our approach. This work is the first to use time series shapelets for malware detection and information security applications. 1. Introduction The evolving volume, variety and intensity of vulnerabilities due to malicious software (‘malware’) call for smarter malware detection techniques [32]. Most existing antivirus solutions rely on signature- based detection, which requires past exposure to the malware being detected. Such systems fail at detection of new malware which was previously unseen (‘zero- day attacks’) [7]. A new zero-day was discovered each week on average in 2015 [2]. Effective statistical learning approaches can automatically find root patterns behind execution of a malicious file to build a model that can accurately and quickly classify new malware [3,4,29-32,35]. We propose a new approach for malware detection from Microsoft portable executable (PE) files [26] using an advanced time series classification approach, which can pick up local discriminative features from data. Existing approaches to classification that use entropy analysis [16] or wavelet energy spectrum analysis [36] often only use global properties of the input signal. Time series shapelets. Our approach is based on time series shapelets [39]. A shapelet is a subsequence from the input (training) time series, which can discriminate between classes in the data. Intuitively, shapelet-based approaches focus on finding local discriminative features within the data, and once these are found, the rest of the data is discarded. Shapelets have been used for a wide range of time series data mining tasks [11,15,21,22,23,27]. Entropy representation. Modern malware may contain sophisticated malicious code hidden within compressed or encrypted files [5,16]. For instance, parasitic infections and injected shellcode a often rely on packed (compressed) code segments with concealed malicious code. Entropy analysis [5,34,37] has been used to detect such attacks. As observed by Lyda et al. [16], sections of code with packing or encryption tend to have higher entropy. We choose to represent each PE file by an ‘entropy time series.’ We get byte-level content from each file, make non-overlapping chunks of the file content, and compute entropy of each chunk. This gives us the entropy time series (ETS) representation b – one time series for each file. Given labeled training data (both benign and malware files), our aim is to classify a test file as safe or malicious. Thus, we frame the malware detection task as a time series classification task. Challenges. There are multiple challenges involved with malware detection from complex executable files found in the wild [4,13,29,31]. The input data varies in structure, nature, patterns and lengths of time series. Our data has ETS lengths ranging from a couple of data points to more than a hundred thousand data points. Some subsections of ETS show multiple rapid changes in entropy while others are largely flat or zero. a Modeling polymorphic shellcode is an intractable problem [33] b ‘Time series’ is not meant in the literal sense of time, but in the statistical sense of sequential data that is not i.i.d. (independent and identically distributed) 6079 Proceedings of the 50th Hawaii International Conference on System Sciences | 2017 URI: http://hdl.handle.net/10125/41898 ISBN: 978-0-9981331-0-2 CC-BY-NC-ND