Behavior-based Proactive Detection of Unknown Malicious Codes Jianguo Ding ∗‡ , Jian Jin , Pascal Bouvry , Yongtao Hu § and Haibing Guan Faculty of Science, Technology and Communication (FSTC), University of Luxembourg, L-1359 Luxembourg Email: Jianguo.Ding@ieee.org School of Information Science and Technology, East China Normal University, Shanghai, 200062, P. R. China Software Engineering Institute, East China Normal University, Shanghai, 200062, P. R. China § The Third Research Institute of the Ministry of Public Security, P. R. China School of Information Security Engineering, Shanghai Jiao Tong University, Shanghai 200030, P. R. China Abstract—With the rising popularity of the Internet, the re- sulting increase in the number of available vulnerable machines, and the elevated sophistication of the malicious code itself, the detection and prevention of unknown malicious codes meet great challenges. Traditional anti-virus scanner employs static features to detect malicious executable codes and is hard to detect the unknown malicious codes effectively. We propose behavior-based dynamic heuristic analysis approach for proactive detection of unknown malicious codes. The behavior of malicious codes is identified by system calling through virtual emulation and the changes in system resources. A statistical detection model and mixture of expert (MoE) model are designed to analyze the behavior of malicious codes. The experiment results demonstrate the behavior-based proactive detection is efficient in detecting unknown malicious executable codes. I. I NTRODUCTION Malicious code (or malware) is defined as any program (including macros and scripts) that is specifically coded to cause an unexpected (and usually unwanted) event on a user’s PC or a server. Typical examples include viruses, Trojan horses, Worms, Back doors, Spyware, and Adware, etc. One reason for the prevalence of malicious code on today’s networks is the rising popularity of the Internet and the resulting increase in the number of available vulnerable machines because of security-unaware users. Another reason is the elevated sophistication of the malicious code itself [3]. One issue raised was about the behaviour of malicious code and their sources. Surprisingly, the basic functionality of malware has not changed much. The samples that are observed today either steal sensitive information (key loggers, password thieves, Bank Trojans), send spam mails, or can be used to launch denial of service attacks. But the real development of malicious codes make themselves hard to be detected and identified by obfuscation techniques. For example, polymorphic viruses would change form each time the virus infected a new victim. Metamorphic virus will change the structure of the virus body as well as the decryption engine, making it impossible to get a signature match [9]. Meanwhile, mapping out dark (honeypot) address spaces is an emerging threat. As a result, there is a need to develop techniques that can accurately capture emerging threats, since a good intelligence is a prerequisite for subsequent mitigation efforts. Traditional signature-based anti-virus scanner gets segments of file content as the technical component. The analytical component is just a simple comparison between the segments and the signature-pattern database. This method gives birth to a very low false-positive fraction near to zero while it per- forms poorly when facing with previously unknown malicious executables or variants of existing ones. Current anti-virus scanner involves static heuristic to alle- viate this problem. Instead of looking for specific signature of a virus, it looks for virus behavior. Each signature is a generic code sequence that represents a behavior feature and a complex comparison is invited in the analytical component. However, this method also drives data form the file content as the technical component and can be obfuscated by techniques such as polymorphism and metamorphism. Although wildcard have been added to the code sequence to resolve the obfus- cation problem, a high false positive fraction comes along consequently. On the other hand this method depends on aided techniques such as unpacking, decryption and disassembly. This paper tries to use dynamic heuristic method to analyze the running behaviors of malicious codes and try to establish an automatic mechanism to assist classifying and identifying unknown malicious codes. The main contributions are sum- marized as follows: 1. The characteristic behaviors of malicious codes are identified based on the behavior features with corresponding Win32 API calls and their certain parameters. 2. An automatic executable behavior tracing system is implemented to dynamically capture the behavior features we defined. 3. Two approaches are presented for the behavior analysis and to establish classification strategies for proactive detec- tion for malicious codes. Experiment results demonstrate that the proactive strategies are efficient in detecting previously unknown malicious executables. The rest of this paper is organized as follows: Section 2 describes related work on malicious executables detection based on malicious behavior. Section 3 presents the malicious behavior feature definition. Section 4 gives the details of the dynamic behavior analysis for malicious codes. Section 5 2009 Fourth International Conference on Internet Monitoring and Protection 978-0-7695-3612-5/09 $25.00 © 2009 IEEE DOI 10.1109/ICIMP.2009.20 72