Malware identiﬁcation on Portable Executable ﬁles using Opcodes Sequence Alexandre R de Mello Centro de Convergˆ encia Digital e Mecatrˆ onica Fundac¸˜ ao CERTI Florian´ opolis, Brazil aem@certi.org.br Vitor Gama Lemos Intelligence Lab PSafe Cyberlabs Rio de Janeiro, Brazil 0000-0003-1290-5192 Fl´ avio G O Barbosa Computer Vision Department SENAI Institute of Innovation in Embedded Systems Florian´ opolis, Brazil ﬂavio.barbosa@sc.senai.br Em´ılio Simoni AHT Security AHT Security Rio de Janeiro, Brazil emilio@ahtsecurity.com Abstract—Malicious software (malware) is a relevant cyberse- curity threat, as it can damage target systems, hijack data or credentials, and allow remote code execution. In recent years, researchers and companies have focused on uncovering distinct methods for malware detection to avoid system infection. This paper assesses a method that employs opcode sequence analysis, Graph Theory, and Machine Learning to identify malware on Portable Execution ﬁles without the need for execution. An approach used by many researchers is to ﬁnd patterns through the opcode sequences of a ﬁle and use some Artiﬁcial Intelligence based strategy to classify the ﬁle as malware or benign. In this work, we introduce the OSG (Opcode Sequence Graph), a concept for malware detection based on Opcode Sequence, Graph Theory, and Artiﬁcial Intelligence with two new methods: the OSGT (Opcode Sequence Graph Theory) detector and the OSGNN (Opcode Sequence Graph Neural Network). The OSGT extracts the opcode sequence linearly, creates a graph for each ﬁle section, calculates features from a combination of Pagerank and node degree of each section, and uses ensemble learning to classify the ﬁles. The OSGNN logically extracts the opcode sequence to construct a control ﬂow graph, uses the longest available path to create a graph, and applies a graph neural network to classify the ﬁles. We also propose a novel dataset composed of 28,000 ﬁles that contain 14,000 updated malware and 14,000 trusted portable executable Windows ﬁles. The experimental results show that both methods outperform the baseline methods and provide up to 99% malware detection. The outcomes of this study shows that the OSGT is suitable for real-world application considering the processing time and malware detection capacity, and the OSGNN achieves state-of-art detection capacity for malware with an extra cost of computational cost. Index Terms—Malware detection, Opcode sequence, Graph theory, Opcode graph, Feature Extraction I. I NTRODUCTION Recently, many people use the internet in the most diverse ways possible, such as accessing social networks, performing instant banking transactions, and buying online items. For this reason, criminals also act in the virtual world, using technology to commit virtual crimes and spread malware to Identify applicable funding agency here. If none, delete this. access conﬁdential data of people and companies. Malware is a software that intentionally executes malicious payloads on victim machines with different goals, such as damaging the targeted system, allowing remote code execution, data hijacking, stealing conﬁdential data, etc [1]. To avoid unwanted malware execution, and of course be a victim of cybernetic attacks, this work proposes a framework to identify malware on Portable Execution (PE) ﬁles. The relevance of the proposed method relies on identifying the malware before its executing, i.e., the PE ﬁle does not need to be executed to evaluate if the ﬁle is a malware or not. We present two different methods that rely on extracting the opcode sequence from ﬁles, create a graph that represents the opcode sequence of a ﬁle, create features from the graph, and train a binary classiﬁer to identify if a ﬁle is trusted or malware. To evaluate the performance on real world scenario, we compare multiple evaluation metrics of the proposed methods against two variations of Long Short-term Memory (LSTM) Recurrent Neural Networks depicted along this paper. The main contribution of this paper are: • The method to convert the opcode sequence of a PE ﬁle into a single or multiple graphs; • The featurization process based on an opcode sequence and its graph representation; • Using a graph neural network or an ensemble tree-based method for classiﬁcation for malware detection; • Evaluation of processing time on different disassembler methods for real-world usage; • A real-world dataset; The proposed work is relevant to both academia and real- world usage as it introduces two different methods for malware detection without the need for ﬁle execution, and it explores and compares the usage of a graph representation of opcode sequences for PE ﬁles using different disassembling, featur- ization, and classiﬁcation methods. In the malware ﬁles investigation, there are two techniques XVI Brazilian Conference on Computational Intelligence (CBIC 2023), Salvador, October 8th to 11th 1