ech T Press Science Computers, Materials & Continua DOI:10.32604/cmc.2021.018260 Article Toward Robust Classifers for PDF Malware Detection Marwan Albahar * , Mohammed Thanoon, Monaj Alzilai, Alaa Alrehily, Munirah Alfaar, Maimoona Algamdi and Norah Alassaf College of Computers in Al-Leith, Umm Al Qura University, Makkah, Saudi Arabia * Corresponding Author: Marwan Albahar. Email: mabahar@uqu.edu.sa Received: 03 March 2021; Accepted: 19 April 2021 Abstract: Malicious Portable Document Format (PDF) fles represent one of the largest threats in the computer security space. Signifcant research has been done using handwritten signatures and machine learning based on detection via manual feature extraction. These approaches are time consuming, require substantial prior knowledge, and the list of features must be updated with each newly discovered vulnerability individually. In this study, we propose two models for PDF malware detection. The frst model is a convolutional neural network (CNN) integrated into a standard deviation based regularization model to detect malicious PDF documents. The second model is a support vector machine (SVM) based ensemble model with three different kernels. The two models were trained and tested on two different datasets. The experimental results show that the accuracy of both models is approximately 100%, and the robustness against evasive samples is excellent. Further, the robustness of the models was evaluated with malicious PDF documents generated using Mimicus. Both models can distinguish the different vulnerabilities exploited in malicious fles and achieve excellent performance in terms of generalization ability, accuracy, and robustness. Keywords: Malicious PDF classifcation; robustness; guiding principles; convolutional neural network; new regularization 1 Introduction Malware remains a hot topic in the feld of computer security. It is employed by criminals, industries, and even government actors for espionage, theft, and other malicious endeavors. With several million new malware strains emerging daily, identifying them before they harm computers or networks is one of the most pressing challenges of cyber security. Over the last 20 years, hackers have continuously discovered new forms of attacks, giving rise to numerous malware types. Some hackers have utilized macros within Microsoft Of fce documents, while others have found code in JavaScript fles via which browsers were vulnerable. The implication of the range of malware is the necessity of novel automated technology for addressing these attacks. A popular form of the document fle is the Portable Document Format (PDF). Although users were unaware, the PDF was transformed into a signifcant attack vector (AV) for malware operators. Each year, many vulnerabilities are revealed in Adobe Reader, the most widely used software for reading PDF This work is licensed under a Creative Commons Attribution 4.0 International License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.