SIMILARITY COEFFICIENT GENERATORS FOR NETWORK FORENSICS A. Telidevara 1 V. Chandrasekaran 1 A. Srinivasan 2 R. Mukkamala 3 S. Gampa 3 1 Sri Sathya Sai Institute of Higher Learning, Prashanti Nilayam, Anantapur, AP 515134, India 2 Bloomsburg University, Bloomsburg, PA 17185, USA 3 Old Dominion University, Norfolk, VA 23529-0162, USA ABSTRACT IP spoofing is one of the most common network threats today. While current IP Traceback techniques are capable of identifying the source of a message, they are limited by the huge number of messages that routers have to store to provide this facility. One way to reduce the storage overhead is to store the messages as indices in a Bloom filter. Current systems use Bloom filters at a router to know if a given message has gone through that router. However, often there is a need to know if a similar message has traversed through the router. This calls for similarity measures in the context of Bloom filters. In this paper, we develop such similarity measures (coefficients) in the context of two specialized Bloom filters---Hierarchical Bloom filter (HBF) and Winnowing Block Shingling (WBS). We compare the efficacy of these similarity measures with the Jaccard similarity coefficient. Simulations were carried out to evaluate the measures. The results indicate that HBF-measure is an optimistic metric and WBS-similarity is a pessimistic measure. Jaccard measure falls between the two. We propose a weighted metric that combines all the metrics and is more flexible than the individual measures. EDICS — SEC-NETW, FOR-CLAS, MOD-PERF 1. INTRODUCTION Today, with the open model of the internet, more and more networks are being attacked either to put subnets or web servers out of commission (by denial-of-service attacks) or to cause damage to a system via messages with viruses. Mechanisms to stop or mitigate such attacks depend on our ability to identify and shut out the source host. With IP spoofing, it is not so easy to identify the source from the header of the messages. This is where IP tracing has been quite useful. Various methods have been proposed for IPtraceback which include IP marking [8, 13], ICMP Traceback [1], Overlay networks [1], Hash-based IP Traceback [11], etc. Due to the high rate of traffic through the routers, it is important to choose techniques that are less demanding in terms of processing and storage requirements at a router. For this reason, the hash-based techniques have become popular. Bloom filter is one such technique [2]. Variants of Bloom filters such as Hierarchical Bloom filters [9] and Winnowing Block Shingling have become quite popular [6, 10] since they reduce the false positives. Typically, an attacker may be attacking several sites by sending similar messages. This implies that in order to traceback the source of a message efficiently, we need to look for the source of a given message and the source of messages (if any) that are similar to that message. This approach helps to trace the source much faster as well as to identify multiple sources (if any) from which attacks are being generated. In this context, it is important to know whether a message similar to a message under investigation has been inserted into a Bloom filter. In this context, we develop similarity measures, especially for use in Bloom filters and their variants [6, 10]. These measures are compared with Jaccard measure that is known to be a very popular similarity measure in several areas such as ranking fuzzy numbers [7], bio-geographic classifications [5], and cellular manufacturing systems [14]. The robustness of the suggested similarity measures in the context of IP traceback is measured by running several simulation runs. The results clearly indicate that HBF similarity measure is the most optimistic measure and WBS similarity measure is the most conservative of the three. Jaccard measure falls between the two. This observation applies across different payload sizes, across different loads (number of messages that a Bloom filter has seen), different Bloom filter sizes, and the percentages of changes in the test message compared to the original message. Based on the results, we suggest a weighted metric that can combine more than one measure so depending on the application a user can choose the right level of similarity. The paper is organized as follows. In section 2, we briefly summarize the related concepts. Section 3 defines the similarity measures. In section 4, we describe the simulation experiments and the obtained results. Finally, section 5 has some concluding remarks based on the results. 2. RELATED WORK