Comparison of two feature selection methods in Intrusion Detection Systems M. J. Fadaeieslam 1, 2 , B. Minaei-Bidgoli 2 , M. Fathy 2 , M. Soryani 2 1 Islamic Azad University – Semnan Branch 2 Department of Computer Engineering, Iran University of Science and Technology fadaei@iust.ac.ir, b_minaei@iust.ac.ir, mahfathy@iust.ac.ir, soryani@iust.ac.ir Abstract The quality of features directly affects the performance of classification. Many feature selection methods introduced to remove redundant and irrelevant features, because raw features may reduce accuracy or robustness of classification. In this paper we proposed a new method for feature selection based on Decision Dependent Correlation (DDC). We have used SVM classifier and the results on DARPA KDD99 benchmark dataset indicate that the proposed method outperforms Principal Component Analysis (PCA). 1. Introduction In complex classification domains like intrusion detection, some features may be irrelevant or redundant which complicate the classification process. The main goal in feature selection is to reduce the amount of data which are less important to the classification and can be eliminated. This has the benefit of decreasing storage requirements, reducing processing time and improving the detection rate [1]. An IDS needs examining very large audit data. Therefore it should reduce the amount of data to save the processing time. Feature reduction may be done in several ways [1, 2 and 3]. We proposed a new method based on DDC parameter [4] and compared it with PCA on KDD99 dataset. It is very important to note that the KDD99 dataset has a large number of duplicated samples and the number of attack types is not the same. These affect the performance of classifier systems and must be addressed carefully. After a computational intensive preprocessing phase and a feature extraction, SVM classifier was used for classification. The rest of this paper organized as follows. The preprocessing phase is explained in section 2. The PCA and proposed method are described in section 3 and 4. Experimental results and conclusions are reported in section 5 and 6. 2. Preprocessing The 1998 DARPA Intrusion Detection Evaluation Program provided a standard set of audited data, which includes a wide variety of intrusions simulated in a U.S. Air Force LAN environment. The 1999 KDD intrusion detection contest used a version of this dataset [5, 6 and 7]. The raw training data was about 4 GB of compressed binary TCP dump data from seven weeks of network traffic. This contained about five million connection records, and the training 10% dataset consisted of 494021 records. Similarly, the two weeks of test data yielded around two million connection records. For each connection, 41 features were defined, categorized as Basic TCP features, content features, Time-based traffic features, and Host- based traffic features. Each connection is labeled as either normal, or attack, with exactly one specific attack type. Attacks fall into four main categories: Denial of Service (DoS), Remote-to-Local (R2L), User-to-root (U2R), Probing [6]. The goal in this task was to classify the test dataset containing 311029 connection records into normal or attack (The type of attack was not important). It is important to note that: 1. The test data is not from the same probability distribution as the training data. 2. It includes specific attack types not in the training data. 3. The KDD 1999 Cup dataset has a very large number of duplicate records [7]. In this paper, these duplicate samples were removed from the dataset. After filtering out the duplicate records, the total numbers of records in training and testing datasets were 145586 and 77291 respectively. Features in the KDD dataset are continuous, discrete or symbolic. After removing duplicate data, symbolic features were converted to numeric ones. Continuous features must be discretized for computing Seventh International Conference on Computer and Information Technology 0-7695-2983-6/07 $25.00 © 2007 IEEE DOI 10.1109/CIT.2007.99 83