Automatic Malware Categorization Using Cluster Ensemble Yanfang Ye Dept. of Computer Science Xiamen University Xiamen, 361005, P.R.China yeyanfang@yahoo.com.cn Tao Li School of Computer Science Florida International University Miami, FL, 33199, USA taoli@cs.fiu.edu Yong Chen Internet Security R&D Center Kingsoft Corporation Zhuhai, 519015, P.R.China chenyong@kingsoft.com Qingshan Jiang Software School Xiamen University Xiamen, 361005, P.R.China qjiang@xmu.edu.cn ABSTRACT Malware categorization is an important problem in malware anal- ysis and has attracted a lot of attention of computer security re- searchers and anti-malware industry recently. Today’s malware samples are created at a rate of millions per day with the develop- ment of malware writing techniques. There is thus an urgent need of effective methods for automatic malware categorization. Over the last few years, many clustering techniques have been employed for automatic malware categorization. However, such techniques have isolated successes with limited effectiveness and efficiency, and few have been applied in real anti-malware industry. In this paper, resting on the analysis of instruction frequency and function-based instruction sequences, we develop an Automatic Mal- ware Categorization System (AMCS) for automatically grouping malware samples into families that share some common character- istics using a cluster ensemble by aggregating the clustering solu- tions generated by different base clustering algorithms. We propose a principled cluster ensemble framework for combining individual clustering solutions based on the consensus partition. The domain knowledge in the form of sample-level constraints can be naturally incorporated in the ensemble framework. In addition, to account for the characteristics of feature representations, we propose a hy- brid hierarchical clustering algorithm which combines the merits of hierarchical clustering and k-medoids algorithms and a weighted subspace K-medoids algorithm to generate base clusterings. The categorization results of our AMCS system can be used to generate signatures for malware families that are useful for malware detec- tion. The case studies on large and real daily malware collection from Kingsoft Anti-Virus Lab demonstrate the effectiveness and efficiency of our AMCS system. Categories and Subject Descriptors I.2.6 [Artificial Intelligence]: Learning; D.4.6 [Operating Sys- tem]: Security and Protection - Invasive software Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. KDD’10, July 25–28, 2010, Washington, DC, USA. Copyright 2010 ACM 978-1-4503-0055-1/10/07 ...$10.00. General Terms Algorithms, Experimentation, Security Keywords malware categorization, cluster ensemble, signature 1. INTRODUCTION 1.1 Malware Categorization Due to its damage to computer security, malware (such as virus, worms, Trojan Horses, spyware, backdoors, and rootkits) has caught the attention of computer security researchers for decades. Cur- rently, the most significant line of defense against malware is Anti- Virus (AV) software products which mainly use signature-based method to recognize threats. Given a collection of malware sam- ples, these AV venders first categorize the samples into families so that samples in the same family share some common traits, and generate the common string(s) to detect variants of a family of mal- ware samples. For many years, malware categorization have been primarily done by human analysts, where memorization, looking up description libraries, and searching sample collections are typically required. The manual process is time-consuming and labor-intensive. To- day’s malware samples are created at a rate of millions per day with the development of malware writing techniques. For example, the number of new malware samples collected by the Anti-virus Lab of Kingsoft is usually larger than 10, 000 per day. There is thus an urgent need of effective methods for automatic malware categorization. Over the last few years, many research efforts have been con- ducted on developing automatic malware categorization systems [4, 12, 10, 15, 18, 24]. In these systems, the detection process is gen- erally divided into two steps: feature extraction and categorization. In the first step, various features such as Application Programming Interface (API) calls and instruction sequences are extracted to cap- ture the characteristics of the file samples. These features can be extracted via static analysis and/or dynamic analysis. In the second step, intelligent techniques are used to automatically categorize the file samples into different classes based on computational analy- sis of the feature representations. These intelligent malware detec- tion systems are varied in their use of feature representations and categorization methods. They have isolated successes in cluster- ing and/or classifying particular sets of malware samples, but they have limitations on the effectiveness and efficiency and few have