Racewalk: fast instruction frequency analysis and classiﬁcation for shellcode detection in network ﬂow Dennis Gamayunov, Nguyen Thoi Minh Quan, Fedor Sakharov, Edward Toroshchin Computer Systems Lab, Department of Computational Mathematics and Cybernetics, Moscow State University, Moscow, Russia Email: {gamajun,ntmquan,sakharov,hades}@lvk.cs.msu.su Abstract—Memory corruption attacks still play a signiﬁcant role in present cybercrime activities, being one of the keystones for worm, virus propagation and building botnets. Moreover, recent disclosures of widespread networking equipment vulner- abilities show that the problem is unlikely to fade away in the near future. The subject of this paper is NOP-sled detection — one of the approaches for detecting malicious code in network ﬂow. NOP-sled is a quite common shellcode preamble used in memory corruption attacks to increase the probability of successful target exploitation. We propose a signiﬁcant modiﬁ- cation of the Stride algorithm which has linear computational complexity and runs over 10 times faster than original Stride and a novel approach for NOP-sled detection using IA-32 instruction frequency analysis and SVM-based classiﬁcation, which gives signiﬁcantly less false positives then existing al- gorithms. Evaluation with Metasploit Framework, CLET, ecl-poly and ADMmutate shows that various NOP-sleds provided by existing shellcode generators have instructions frequency peculiarities, which allow to distinguish between sleds and normal network data with high accuracy while reducing the false positives rate and operating close to 1Gbps speed. Keywords-shellcode; polymorphism; metamorphism; intru- sion detection; intrusion prevention; SVM; support vector machine; instruction frequency analysis; I. I NTRODUCTION Nowadays one of the most vexing and emerging problems in computer networks is widespread and extensive growth of botnets, which are used by cybercrime for various illegal and disturbing activities: distributed denial-of-service attacks, phishing, even abuse-immunity hosting and so on. Internet worms not only help build botnets and infect victims, but also cause a heavy load on Internet infrastructure. In recent years a considerable effort made by the scientiﬁc community was in developing methods for detecting malicious machine code in network ﬂow, which help detect worm propagation at the earliest stages of the outbreak. This research direction seems very promising because it allows taking preventive measures instead of ﬁghting against phenomenon conse- quences. One of the attack directions undertaken by worm, virus or intruder to take control over some target network nodes is vulnerability search and exploitation, whether in target’s operating system or networking application (web server, browser, mail client, etc). When such vulnerability actually exists and is remotely exploitable, it allows arbi- trary code execution in the address space of the vulnerable program and with effective execution permissions of that program. However, actual permissions of the vulnerable program often do not matter since most of the modern operating systems and common applications have lots of locally exploitable vulnerabilities which can be used by the shellcode payload as a second target. Historically, the very ﬁrst step performed by malicious code after successful vulnerability exploitation was starting a command shell, providing attacker with interactive command access to the victim host. That’s why “shellcode” became a common designation for any malicious code used in ”memory cor- ruption” attacks. In the early days shellcodes used to be static and could be detected in the network channel with simple patterns (or signatures). Signatures are still used in popular intrusion detection and preventions systems like Snort, Bro and others to detect some well-known malicious instruction sequences. Eventually attackers found different ways to evade signature- based analysis using polymorphic and obfuscated shellcodes. A typical polymorphic shellcode generator like those in Metasploit Framework [1], ecl-poly [2], ADMmutate [3] or CLET [4] allows the creation of multiple variations of the given shellcode with equal functionality. Signature analysis fails to detect such shellcode, for there are millions of possible code variations and no possibility to process the corresponding number of signatures. The task of shellcode detection in network ﬂow has an important peculiarity — the key feature of any solution is performance along with accuracy and false positives rate, because the event frequency (packet arrival) is very high and even “low” false positives rate about 10 -3 % results in a huge number of false alerts. Besides, any solution which could be used in network appliance is heavily restricted in data sources that can be used for analysis. For example, the memory state or the overall execution context of the programs on network hosts can’t be used and we can’t employ any dynamic control ﬂow analysis. All types of remote exploits implement one of ”memory corruption” attack variations. Memory corruption occurs when some code within a program writes to memory more