0 90 91 9 2 93 94 95 96 97 98 9 9 00 01 20000 40000 60000 Data Mining for Network Intrusion Detection Paul Dokas, Levent Ertoz, Vipin Kumar, Aleksandar Lazarevic, Jaideep Srivastava, Pang-Ning Tan Computer Science Department, 200 Union Street SE, 4-192, EE/CSC Building University of Minnesota, Minneapolis, MN 55455, USA {dokas, ertoz, kumar, aleks, srivasta, ptan}@cs.umn.edu Abstract This paper gives an overview of our research in build- ing rare class prediction models for identifying known intrusions and their variations and anomaly/outlier detec- tion schemes for detecting novel attacks whose nature is unknown. Experimental results on the KDDCup’99 data set have demonstrated that our rare class predictive mod- els are much more efficient in the detection of intrusive behavior than standard classification techniques. Experi- mental results on the DARPA 1998 data set, as well as on live network traffic at the University of Minnesota, show that the new techniques show great promise in detecting novel intrusions. In particular, during the past few months our techniques have been successful in automatically identifying several novel intrusions that could not be de- tected using state-of-the-art tools such as SNORT. In fact, many of these have been on the CERT/CC list of recent advisories and incident notes. 1. Introduction As the cost of the information processing and Internet accessibility falls, more and more organizations are be- coming vulnerable to a wide variety of cyber threats. Ac- cording to a recent survey [1] by CERT/CC (Computer Emergency Response Team/Coordination Center), the rate of cyber attacks has been more than doubling every year in recent times (Figure 1). It has become increasingly important to make our information systems, especially those used for critical functions in the military and com- mercial sectors, resistant to and tolerant of such attacks. Intrusion detection includes identifying a set of mali- cious actions that compromise the integrity, confidential- ity, and availability of information resources. Traditional methods for intrusion detection are based on extensive knowledge of signatures of known attacks. Monitored events are matched against the signatures to detect intru- sions. These methods extract features from various audit streams, and detect intrusions by comparing the feature values to a set of attack signatures provided by human experts. The signature database has to be manually re- vised for each new type of intrusion that is discovered. A significant limitation of signature-based methods is that they cannot detect emerging cyber threats, since by their very nature these threats are launched using previously unknown attacks. In addition, even if a new attack is dis- covered and its signature developed, often there is a sub- stantial latency in its deployment across networks. These limitations have led to an increasing interest in intrusion detection techniques based upon data mining [2, 3, 4, 5, 6]. Figure 1. Cyber Incidents Reported to CERT/CC Data mining based intrusion detection techniques gen- erally fall into one of two categories; misuse detection and anomaly detection. In misuse detection, each instance in a data set is labeled as ‘normal’ or ‘intrusion’ and a learning algorithm is trained over the labeled data. These techniques are able to automatically retrain intrusion de- tection models on different input data that include new types of attacks, as long as they have been labeled appro- priately. Unlike signature-based intrusion detection sys- tems, models of misuse are created automatically, and can be more sophisticated and precise than manually created signatures. A key advantage of misuse detection tech- niques is their high degree of accuracy in detecting known attacks and their variations. Their obvious drawback is the inability to detect attacks whose instances have not yet been observed. Anomaly detection, on the other hand, builds models of normal behavior, and automatically de- tects any deviation from it, flagging the latter as suspect. Anomaly detection techniques thus identify new types of intrusions as deviations from normal usage [7, 8]. While an extremely powerful and novel tool, a potential draw- back of these techniques is the rate of false alarms. This can happen primarily because previously unseen (yet le- gitimate) system behaviors may also be recognized as anomalies, and hence flagged as potential intrusions. 21