The Parallel Sieve Method for a Virus Scanning Engine Hiroki Nakahara , Tsutomu Sasao , Munehiro Matsuura , and Yoshifumi Kawamura †† Kyushu Institute of Technology, Japan ††Renesas Technology Corp., Japan Abstract This paper shows a new architecture for a virus scanning system, which is different from that of an intrusion detec- tion system. The proposed method uses two-stage match- ing: In the first stage, a hardware filter quickly scans the text to find partial matches, and in the second stage, the MPU scans the text to find a total match in the ClamAV 514,287 virus pattern set. To make the hardware filter sim- ple, we use a finite-input memory machine (FIMM). To reduce the memory size of the FIMM, we introduce the parallel sieve method. The proposed method is memory- based, so it is quickly reconfigurable and dissipates lower power than a TCAM-based method. The system is imple- mented on the Stratix III FPGA with three off-chip SRAMs and an SDRAM, where all ClamAV 514,287 virus patterns are stored. Compared with existing methods, our method achieves 1.41-31.36 times more efficient area-throughput ratio. 1 Introduction A malware (a composite word from mal icious software ) intends to damage computer systems. With the wide use of the Internet, users can easily access and download dan- gerous data. So, the risk of infection by the malware is in- creasing. Malware secretly installs a bot virus, a back door, or a keylogger. As a result, the exploitation of the pass- word, the stealing of the information, and illegal remote operation can do damage to computer users. Although a software-based virus scanning system can clean and isolate the malware, the throughput for software-based scanning is at most tens of mega bits per second (Mbps) [16]. Thus, the software-based approach cannot keep up with the mod- ern Internet throughput which is more than one giga bits per seconds (Gbps). Malware is becoming more prevalent and more complex, and so virus scanning on computer systems will be a bottleneck in the future. Recently, hardware-based virus scanning systems are attached to the gateway between the Internet and the Intranet [22]. Fig. 1 shows an exam- ple of a virus scanning system. To detect the malware, first the packet receiver assembles the data from the incoming packets. Also, for compressed data, the packet receiver de- compresses it. Then, the virus scanning engine inspects the data to see if it contains the malware. Finally, the packet sender assembles the data to packets, and sends them to the Intranet. The most important part of the virus scanning sys- tem is the virus scanning engine. Other parts can be realized Virus Scanning System Firewall Internet Router PHY/MAC Ports Packet Receiver Virus Scanning Engine Packet Sender PHY/MAC Ports Intranet Figure 1. Virus Scanning System. by the conventional technique. So, in this paper, we focus on the virus scanning engine. In a typical commercial virus scanning system[22], the throughput is about 1.2 Gbps, the power dissipation is 450 W, and the price is around $10,000. Here, we consider a virus scanning engine with the following features: High throughput It has a throughput with more than one Gbps. Low power It is SRAM-based rather than TCAM- based [3, 24]. The power dissipation for the TCAM is higher than that for the SRAM. Also, the number of tran- sistors per bit for the TCAM is larger than that for the SRAM [14]. Table 1 [9] compares the SRAM with the TCAM. High-speed reconfigurable It uses a memory-based real- ization rather than the hard-wired realization. Some virus scanning software, e.g., Kaspersky [10], updates the virus data every hour. Although, the random logic implementa- tion of the virus scanning circuit on the FPGA [6] is fast and compact, the time for place-and-route is longer than the period for the virus pattern update. Thus, the hard-wired system is unsuitable for quick update. Memory efficient Various memory-based methods have been proposed. For example, in [5], the patterns are em- bedded into the memory of a sequential circuit for an Aho- Corasick automaton. It requires 46 bytes per character. For the current version (v.0.94.2) of ClamAV [7] (the most pop- ular open source anti-virus software), the number of pat- terns is 514,287. To store all the patterns, tens of high-end 1