Malware identification on Portable Executable files using Opcodes Sequence Alexandre R de Mello Centro de Convergˆ encia Digital e Mecatrˆ onica Fundac¸˜ ao CERTI Florian´ opolis, Brazil aem@certi.org.br Vitor Gama Lemos Intelligence Lab PSafe Cyberlabs Rio de Janeiro, Brazil 0000-0003-1290-5192 Fl´ avio G O Barbosa Computer Vision Department SENAI Institute of Innovation in Embedded Systems Florian´ opolis, Brazil flavio.barbosa@sc.senai.br Em´ılio Simoni AHT Security AHT Security Rio de Janeiro, Brazil emilio@ahtsecurity.com Abstract—Malicious software (malware) is a relevant cyberse- curity threat, as it can damage target systems, hijack data or credentials, and allow remote code execution. In recent years, researchers and companies have focused on uncovering distinct methods for malware detection to avoid system infection. This paper assesses a method that employs opcode sequence analysis, Graph Theory, and Machine Learning to identify malware on Portable Execution files without the need for execution. An approach used by many researchers is to find patterns through the opcode sequences of a file and use some Artificial Intelligence based strategy to classify the file as malware or benign. In this work, we introduce the OSG (Opcode Sequence Graph), a concept for malware detection based on Opcode Sequence, Graph Theory, and Artificial Intelligence with two new methods: the OSGT (Opcode Sequence Graph Theory) detector and the OSGNN (Opcode Sequence Graph Neural Network). The OSGT extracts the opcode sequence linearly, creates a graph for each file section, calculates features from a combination of Pagerank and node degree of each section, and uses ensemble learning to classify the files. The OSGNN logically extracts the opcode sequence to construct a control flow graph, uses the longest available path to create a graph, and applies a graph neural network to classify the files. We also propose a novel dataset composed of 28,000 files that contain 14,000 updated malware and 14,000 trusted portable executable Windows files. The experimental results show that both methods outperform the baseline methods and provide up to 99% malware detection. The outcomes of this study shows that the OSGT is suitable for real-world application considering the processing time and malware detection capacity, and the OSGNN achieves state-of-art detection capacity for malware with an extra cost of computational cost. Index Terms—Malware detection, Opcode sequence, Graph theory, Opcode graph, Feature Extraction I. I NTRODUCTION Recently, many people use the internet in the most diverse ways possible, such as accessing social networks, performing instant banking transactions, and buying online items. For this reason, criminals also act in the virtual world, using technology to commit virtual crimes and spread malware to Identify applicable funding agency here. If none, delete this. access confidential data of people and companies. Malware is a software that intentionally executes malicious payloads on victim machines with different goals, such as damaging the targeted system, allowing remote code execution, data hijacking, stealing confidential data, etc [1]. To avoid unwanted malware execution, and of course be a victim of cybernetic attacks, this work proposes a framework to identify malware on Portable Execution (PE) files. The relevance of the proposed method relies on identifying the malware before its executing, i.e., the PE file does not need to be executed to evaluate if the file is a malware or not. We present two different methods that rely on extracting the opcode sequence from files, create a graph that represents the opcode sequence of a file, create features from the graph, and train a binary classifier to identify if a file is trusted or malware. To evaluate the performance on real world scenario, we compare multiple evaluation metrics of the proposed methods against two variations of Long Short-term Memory (LSTM) Recurrent Neural Networks depicted along this paper. The main contribution of this paper are: • The method to convert the opcode sequence of a PE file into a single or multiple graphs; • The featurization process based on an opcode sequence and its graph representation; • Using a graph neural network or an ensemble tree-based method for classification for malware detection; • Evaluation of processing time on different disassembler methods for real-world usage; • A real-world dataset; The proposed work is relevant to both academia and real- world usage as it introduces two different methods for malware detection without the need for file execution, and it explores and compares the usage of a graph representation of opcode sequences for PE files using different disassembling, featur- ization, and classification methods. In the malware files investigation, there are two techniques XVI Brazilian Conference on Computational Intelligence (CBIC 2023), Salvador, October 8th to 11th 1