Electronics 2022, 11, 3142. https://doi.org/10.3390/electronics11193142 www.mdpi.com/journal/electronics
Article
PDF Malware Detection Based on Optimizable Decision Trees
Qasem Abu Al-Haija
1,
*, Ammar Odeh
2
and Hazem Qattous
3
1
Department of Cybersecurity, Princess Sumaya University for Technology (PSUT), Amman 11941, Jordan
2
Department of Computer Science, Princess Sumaya University for Technology (PSUT),
Amman 11941, Jordan
3
Department of Software Engineering, Princess Sumaya University for Technology (PSUT),
Amman 11941, Jordan
* Correspondence: q.abualhaija@psut.edu.jo
Abstract: Portable document format (PDF) files are one of the most universally used file types. This
has incentivized hackers to develop methods to use these normally innocent PDF files to create se-
curity threats via infection vector PDF files. This is usually realized by hiding embedded malicious
code in the victims’ PDF documents to infect their machines. This, of course, results in PDF malware
and requires techniques to identify benign files from malicious files. Research studies indicated that
machine learning methods provide efficient detection techniques against such malware. In this pa-
per, we present a new detection system that can analyze PDF documents in order to identify benign
PDF files from malware PDF files. The proposed system makes use of the AdaBoost decision tree
with optimal hyperparameters, which is trained and evaluated on a modern inclusive dataset, viz.
Evasive-PDFMal2022. The investigational assessment demonstrates a lightweight and accurate PDF
detection system, achieving a 98.84% prediction accuracy with a short prediction interval of 2.174
μSec. To this end, the proposed model outperforms other state-of-the-art models in the same study
area. Hence, the proposed system can be effectively utilized to uncover PDF malware at a high de-
tection performance and low detection overhead.
Keywords: portable document format (PDF); machine learning; detection; optimizable decision
tree; AdaBoost; PDF malware; evasion attacks; cybersecurity
1. Introduction
A piece of harmful code that has the potential to damage a computer or network is
referred to as malware. As conventional signature-based malware detection technologies
become useless and unworkable, recent years have seen a significant increase in malware.
Malware developers and cybercriminals have adopted code obfuscation techniques,
which reduce the efficiency of defensive mechanisms against malware [1,2].
Malware classification and identification remain a challenge in this decade. This is
largely because advanced malware is more sophisticated and has the cutting-edge ability
to remain hidden or change its code or behavior to behave more intelligently. As a result,
outdated detection and classification methods are less useful today. As a result, the focus
has shifted to machine learning for better malware identification and categorization [3,4].
Malicious PDF software is one of the common hacking methods [5]. Forensic research
is hampered by the difficulty of separating harmful PDFs from large PDF files. Machine
learning has advanced to the point where it may now be used to detect malicious PDF
documents to assist forensic investigators or shield a system from assault [6]. However,
adversarial techniques have been developed against malicious document classifiers. Pre-
cision-manipulation-based hostile examples that have been carefully crafted could be mis-
classified. This poses a danger to numerous machine-learning-based detectors [7,8]. For
particular attacks, various analysis or detection methods have been provided. The threat
posed by adversarial attacks has not yet been fully overcome. Figure 1 depicts a PDF doc-
ument’s header, body, cross-reference table (xref), and trailer components [9].
Citation: Al-Haija, Q.A.; Odeh, A.;
Qattous, H. PDF Malware Detection
Based on Optimizable Decision
Trees. Electronics 2022, 11, 3142.
https://doi.org/10.3390/
electronics11193142
Academic Editors: Jungong Han and
Ahmed Abu-Siada
Received: 6 September 2022
Accepted: 28 September 2022
Published: 30 September 2022
Publisher’s Note: MDPI stays neu-
tral with regard to jurisdictional
claims in published maps and institu-
tional affiliations.
Copyright: © 2022 by the authors. Li-
censee MDPI, Basel, Switzerland.
This article is an open access article
distributed under the terms and con-
ditions of the Creative Commons At-
tribution (CC BY) license (http://crea-
tivecommons.org/licenses/by/4.0/).