Machine Learning Techniques to Detect Maliciousness of Portable Executable Files AKRAM M. RADWAN Department of Engineering and Information Systems University College of Applied Sciences Gaza, Palestine aradwan@ucas.edu.ps Abstract— In the past few years, malware has become one of the most significant threats to computer security. Malware or malicious is software that attackers use or program to interrupt the operations of a computer, to collect secret or private information, or to access computer systems without being authorized to do. In this paper, we presented a machine learning- based approach to classifying a portable executable (PE) file as benign or malware with high accuracy. The proposed approach used the static analysis technique to extract the integrated feature set, which was created by combining a few raw features selected from the three main headers of PE files and a set of derived features. Seven supervised learning algorithms are used in the classification of malware. We compared the performance of each classifier in terms of accuracy, precision, and F -measure. The experimental results indicate that the integrated feature set performs better than the raw feature set on all metrics. Integrated dataset accuracy values are between 91% and 99%, against the raw dataset values which are between 71% and 97% using (70/30) split method. Random Forest has outperformed all classifiers on both datasets (with accuracy of 99.23%). Keywords— Portable executable; malware detection; machine learning; static analysis. I. INTRODUCTION Malicious or malware are some programs transferred in computers while the owner is unaware of them and created to harm, interrupt or damage computers, networks or files [1]. Malware exists in different forms. Such programs are mainly categorized under the following classes: virus, worms, trojan horse, rootkit, spyware, adware, sniffers, keyloggers, spam, and ransomware. They are not mutually exclusive although many of them exist in more than one class [1,2]. The traditional antimalware’s use blacklisting techniques to detect malwares which will be useless when the malware is not in the antimalware’s list [2]. Therefore, the solution is to develop a system which can detect the malware by collecting data from the files header and then analyze such data to decide the file is either malware or benign. Malware analysis is a process of analyzing the components and behavior of malware. In malware analysis, features can be generated in two different types of methods: Static Malware Analysis or Dynamic Malware Analysis [3]. A. Static Malware Analysis Static analysis is a process by which malware binary is analyzed and the features are extracted without actual execution of the code [4]. The process is performed by identifying the signature of the binary file, and thus it involves a unique identification for the binary. This approach is namely signature based detection technique, while non-signature based technique detects malicious file or program by applying a set of rules to extract the features and then train the features to generate a classifier model. B. Dynamic Malware Analysis Dynamic analysis aims to analyze the behavior of a malicious code during runtime. This process is intended to remove the infection or block its spreading into other systems [3,5]. Dynamic analysis approaches execute a suspicious malware sample in a controlled environment and detect whether it is indeed malware or not. Some malware detection systems employ only static or dynamic methodologies, but others use both [6]. Machine learning algorithms are used to design static analysis techniques for malware detection. These techniques try to extract a feature set to build a robust malware detection systems. After the feature extraction, each file can be represented as a feature vector that can be used by the classification method to improve the performance of malware detection systems. The Portable Executable (PE) format is a file format for executable files that are used in Windows operating systems. Executable file extensions include EXE, DLL, SYS, APP, SCR, BAT etc [7]. The PE format is a data structure that encapsulates the information necessary for the Windows OS loader to manage the wrapped executable code [8]. A PE file contains three types of headers namely MS-DOS header, File header and Optional header [7]. We choose the PE files because they are the most used file formats due to the wide use of Windows OS and because around 48% of files submitted to Virustotal 1 are PE files. This study presented a machine learning (ML) based approach to classify a PE file as benign or malware with high accuracy. The approach used a static analysis technique to extract integrated features set, which has a higher discrimination capability between two class labels. 1 https://virustotal.com/en/statistics 86 2019 International Conference on Promising Electronic Technologies (ICPET) 978-1-7281-2337-0/19/$31.00 ©2019 IEEE DOI 10.1109/ICPET.2019.00023