Machine Learning Techniques to Detect Maliciousness of
Portable Executable Files
AKRAM M. RADWAN
Department of Engineering and Information Systems
University College of Applied Sciences
Gaza, Palestine
aradwan@ucas.edu.ps
Abstract— In the past few years, malware has become one of
the most significant threats to computer security. Malware or
malicious is software that attackers use or program to interrupt
the operations of a computer, to collect secret or private
information, or to access computer systems without being
authorized to do. In this paper, we presented a machine learning-
based approach to classifying a portable executable (PE) file as
benign or malware with high accuracy. The proposed approach
used the static analysis technique to extract the integrated
feature set, which was created by combining a few raw features
selected from the three main headers of PE files and a set of
derived features. Seven supervised learning algorithms are used
in the classification of malware. We compared the performance
of each classifier in terms of accuracy, precision, and F -measure.
The experimental results indicate that the integrated feature set
performs better than the raw feature set on all metrics.
Integrated dataset accuracy values are between 91% and 99%,
against the raw dataset values which are between 71% and 97%
using (70/30) split method. Random Forest has outperformed all
classifiers on both datasets (with accuracy of 99.23%).
Keywords— Portable executable; malware detection; machine
learning; static analysis.
I. INTRODUCTION
Malicious or malware are some programs transferred in
computers while the owner is unaware of them and created to
harm, interrupt or damage computers, networks or files [1].
Malware exists in different forms. Such programs are
mainly categorized under the following classes: virus, worms,
trojan horse, rootkit, spyware, adware, sniffers, keyloggers,
spam, and ransomware. They are not mutually exclusive
although many of them exist in more than one class [1,2].
The traditional antimalware’s use blacklisting techniques to
detect malwares which will be useless when the malware is not
in the antimalware’s list [2]. Therefore, the solution is to
develop a system which can detect the malware by collecting
data from the files header and then analyze such data to decide
the file is either malware or benign.
Malware analysis is a process of analyzing the components
and behavior of malware. In malware analysis, features can be
generated in two different types of methods: Static Malware
Analysis or Dynamic Malware Analysis [3].
A. Static Malware Analysis
Static analysis is a process by which malware binary is
analyzed and the features are extracted without actual
execution of the code [4]. The process is performed by
identifying the signature of the binary file, and thus it involves
a unique identification for the binary. This approach is namely
signature based detection technique, while non-signature based
technique detects malicious file or program by applying a set
of rules to extract the features and then train the features to
generate a classifier model.
B. Dynamic Malware Analysis
Dynamic analysis aims to analyze the behavior of a
malicious code during runtime. This process is intended to
remove the infection or block its spreading into other systems
[3,5]. Dynamic analysis approaches execute a suspicious
malware sample in a controlled environment and detect
whether it is indeed malware or not.
Some malware detection systems employ only static or
dynamic methodologies, but others use both [6]. Machine
learning algorithms are used to design static analysis
techniques for malware detection. These techniques try to
extract a feature set to build a robust malware detection
systems. After the feature extraction, each file can be
represented as a feature vector that can be used by the
classification method to improve the performance of malware
detection systems.
The Portable Executable (PE) format is a file format for
executable files that are used in Windows operating systems.
Executable file extensions include EXE, DLL, SYS, APP,
SCR, BAT etc [7]. The PE format is a data structure that
encapsulates the information necessary for the Windows OS
loader to manage the wrapped executable code [8]. A PE file
contains three types of headers namely MS-DOS header, File
header and Optional header [7]. We choose the PE files
because they are the most used file formats due to the wide use
of Windows OS and because around 48% of files submitted to
Virustotal
1
are PE files.
This study presented a machine learning (ML) based
approach to classify a PE file as benign or malware with high
accuracy. The approach used a static analysis technique to
extract integrated features set, which has a higher
discrimination capability between two class labels.
1
https://virustotal.com/en/statistics
86
2019 International Conference on Promising Electronic Technologies (ICPET)
978-1-7281-2337-0/19/$31.00 ©2019 IEEE
DOI 10.1109/ICPET.2019.00023