Detecting Malicious Javascript in PDF through Document Instrumentation Daiping Liu Department of Computer Science College of William and Mary dliu01@email.wm.edu Haining Wang Department of Computer Science College of William and Mary hnw@cs.wm.edu Angelos Stavrou Center for Secure Information Systems George Mason University astavrou@gmu.edu Abstract— An emerging threat vector, embedded malware inside popular document formats, has become rampant since 2008. Owed to its wide-spread use and Javascript support, PDF has been the primary vehicle for delivering embedded exploits. Unfortunately, existing defenses are limited in effectiveness, vulnerable to evasion, or computationally expensive to be employed as an on-line protection system. In this paper, we propose a context-aware approach for detection and confine- ment of malicious Javascript in PDF. Our approach statically extracts a set of static features and inserts context monitoring code into a document. When an instrumented document is opened, the context monitoring code inside will cooperate with our runtime monitor to detect potential infection attempts in the context of Javascript execution. Thus, our detector can identify malicious documents by using both static and runtime features. To validate the effectiveness of our approach in a real- world setting, we first conduct a security analysis, showing that our system is able to remain effective in detection and be robust against evasion attempts even in the presence of sophisticated adversaries. We implement a prototype of the proposed system, and perform extensive experiments using 18623 benign PDF samples and 7370 malicious samples. Our evaluation results demonstrate that our approach can accurately detect and confine malicious Javascript in PDF with minor performance overhead. Keywords-Malcode bearing PDF; malicious Javascript; mal- ware detection and confinement; document instrumentation. I. I NTRODUCTION Malware authors are constantly seeking for new ways to compromise computer systems. Recently, they have em- barked to take advantage of popular forms of data exchange, focusing their attention on malcode-bearing PDF docu- ments [1]. The PDF standard has several unique advantages when used as an attack vector: (1) it has replaced Microsoft Word as the most dominant document format; (2) it has been widely considered to be safe; (3) it is easy to craft a malicious PDF; and more importantly, (4) it supports Javascript. All of these features have made PDF one of the most attractive exploitation vehicles. This is clearly supported by the fact that the number of discovered PDF vulnerabilities has quadrupled in the last five years [2] with many attack cases having been reported [1] [3]. The most striking observation comes from Microsoft malware protection center, showing that the exploitation of old PDF vulnerabilities is on the rise [1]. Despite the increasing number of successful PDF infec- tions and their impact on end users, thus far, only a few methods for detection of malicious PDF have been proposed as response to this emerging threat. Unfortunately, it ap- pears that traditional signature and behavior based detection methods, which are favored by the majority of modern anti- virus software, cannot handle malicious PDF well. Recently, researchers exploit the structural differences between benign and malicious documents to detect malicious PDF [4] [5] [6] [7]. These methods have been proven to be simple, fast, and accurate. However, when attackers are aware of these static features, they can evade easily [8]. Another recent work extracts and tests malicious Javascript in an emulated interpreter [9]. Although it is more robust against evasion, attackers can still exploit syntax obfuscations to subvert Javascript extraction. Also it is very costly to emulate all PDF-specific Javascript objects. In 2009, Adobe announced the Protected Mode, a sandboxing mechanism that runs PDF reader in a confined environment. Although it raises the bar, Adobe Sandbox has its own drawbacks. An obvious one is that there exist vulnerabilities in the sandbox itself. Actually hackers have already discovered different ways to escape Adobe Sandbox [10] [11]. The detection of malicious PDF exhibits two distinct challenges. First, users tend to open multiple PDFs simu- taneously. However, the runtime behaviors of a PDF reader can vary as different documents are opened, and both benign and malicious PDFs are processed by one single thread in the PDF reader. These can inevitably affect detection accuracy due to the interference among multiple open documents. Second, although it is straightforward to locate traditional malware once detected, it is non-trivial to pinpoint these malicious PDF documents since all open documents could be malicious. In this paper, we introduce a context-aware approach to detect and confine malicious Javascript in PDF through static document instrumentation and runtime behavior monitoring. Our method is motivated by the fact that some essential oper- ations of Javascript in malicious PDF rarely occur in benign documents. Our context-aware approach can efficaciously overcome the aforementioned two challenges. On one hand, context-aware approach can make detection features, like suspicious memory consumption, more effective in detec- tion. On the other hand, the context information explicitly indicates which open documents are malicious. There are different ways to achieve context-aware mon- itoring. One intuitive choice is to extract Javascript from documents [9] [14]. Alternatively, Javascript interpreters can be instrumented [15]. But these methods are neither robust nor easy to implement in practice. Instead, we choose to perform static document instrumentation. This method, to the best of our knowledge, has never been explored before for PDF malware detection and confinement. For each PDF Javascript snippet, we include a prologue and epilogue to inform our runtime detector for the entry to and exit from 1