Malware Automatic Analysis César Augusto Borges de Andrade, Claudio Gomes de Mello, Julio Cesar Duarte Computer Engineering Department Military Engineering Institute (IME) Rio de Janeiro-RJ, Brazil { borges, cgmello , duarte }@ime.eb.br Abstract—The malicious code analysis allows malware behavior characteristics to be identified, in other words how does it act in the operating system, what obfuscation techniques are used, which execution flows lead to the primary planned behavior, use of network operations, files downloading operations, user and system's information capture, access to records, among other activities, in order to learn how malware works, to create ways to identify new malicious softwares with similar behavior, and ways of defense. Manual scanning for signature generation becomes impractical, since it requires a lot of time compared to new malwares' dissemination and creation speed. Therefore, this paper proposes the use of sandbox techniques and machine learning techniques to automate software identification in this context. This paper, besides presenting a different and faster approach to malware detection, has achieved an accuracy rate of over 90% for the task of malware identifying. keywords— malware; sandbox; machine learning. I. INTRODUCTION With the growth of the Internet and computer software, the malware (Malicious Software) proliferation, which are programs designed to perform harmful actions on a computer [1], has become a major problem. These malicious programs are able to: steal personal and business information, such as Flame, perform denial of service, execute banking transactions and cause industrial sabotage, like Stuxnet. The term malware is commonly used to refer to any malicious software. According to their behavior, they could be classified as viruses, worms, spywares, trojans, bots, among others [2]. The most common ones, the worms, are capable of multiplying without any human intervention, exploiting vulnerabilities within existing softwares, which allows an easy dissemination. The antivirus, anti-malware main product, can not keep up with the most malware creation and dissemination due to new variants are being created all the time with new evasive skills, making analysis techniques inefficient. The annual number of unique malware samples has increased in the last 10 years, passing the 30 million mark in 2012 [3]. The manual analysis for signatures generation becomes therefore impractical, since it requires too much time compared to a new malware spreading and establishment speed. With the goal of providing the security of a computing environment, it is necessary to detect them efficiently. In this scenario, automatic analysis has proven to be the most efficient option to that process. This analysis is achieved through testing environment's automatic restore mechanisms such as virtual machines. As seen in the proposed model [4], one of the great problems of automatic analysis is that the interpretation of the large reports generated by sandboxes (restricted and controlled environments for artifacts execution, generally suspicious softwares) is left to the user, i.e. it cannot be said that the system has analyzed it, but it has made an execution report with the record of the activities carried out in a given period. This paper proposes the use of sandbox and machine learning to automate malicious codes identification. The proposal main scheme is shown in figure 1, where the Customized Behavioral Analysis will no longer suffer human intervention and will be performed by machine learning techniques. Fig.1. Malware analysis flowchart. This paper is organized as follows: after the introduction are discussed, in section 2, the related works; In section 3, concepts related to malicious code analysis and anti-analysis techniques are covered; Section 4 presents the experiment characteristics and the methodology used as well; Section 5 presents the evaluation results; In Section 6, concluding remarks are carried out and the following section presents the suggestions for future work.