Comparison of two feature selection methods in Intrusion Detection Systems
M. J. Fadaeieslam
1, 2
, B. Minaei-Bidgoli
2
, M. Fathy
2
, M. Soryani
2
1
Islamic Azad University – Semnan Branch
2
Department of Computer Engineering, Iran University of Science and Technology
fadaei@iust.ac.ir, b_minaei@iust.ac.ir, mahfathy@iust.ac.ir, soryani@iust.ac.ir
Abstract
The quality of features directly affects the
performance of classification. Many feature selection
methods introduced to remove redundant and
irrelevant features, because raw features may reduce
accuracy or robustness of classification. In this paper
we proposed a new method for feature selection based
on Decision Dependent Correlation (DDC). We have
used SVM classifier and the results on DARPA KDD99
benchmark dataset indicate that the proposed method
outperforms Principal Component Analysis (PCA).
1. Introduction
In complex classification domains like intrusion
detection, some features may be irrelevant or
redundant which complicate the classification process.
The main goal in feature selection is to reduce the
amount of data which are less important to the
classification and can be eliminated. This has the
benefit of decreasing storage requirements, reducing
processing time and improving the detection rate [1].
An IDS needs examining very large audit data.
Therefore it should reduce the amount of data to save
the processing time. Feature reduction may be done in
several ways [1, 2 and 3]. We proposed a new method
based on DDC parameter [4] and compared it with
PCA on KDD99 dataset. It is very important to note
that the KDD99 dataset has a large number of
duplicated samples and the number of attack types is
not the same. These affect the performance of classifier
systems and must be addressed carefully. After a
computational intensive preprocessing phase and a
feature extraction, SVM classifier was used for
classification.
The rest of this paper organized as follows. The
preprocessing phase is explained in section 2. The
PCA and proposed method are described in section 3
and 4. Experimental results and conclusions are
reported in section 5 and 6.
2. Preprocessing
The 1998 DARPA Intrusion Detection Evaluation
Program provided a standard set of audited data, which
includes a wide variety of intrusions simulated in a
U.S. Air Force LAN environment. The 1999 KDD
intrusion detection contest used a version of this
dataset [5, 6 and 7]. The raw training data was about 4
GB of compressed binary TCP dump data from seven
weeks of network traffic. This contained about five
million connection records, and the training 10%
dataset consisted of 494021 records. Similarly, the two
weeks of test data yielded around two million
connection records. For each connection, 41 features
were defined, categorized as Basic TCP features,
content features, Time-based traffic features, and Host-
based traffic features.
Each connection is labeled as either normal, or
attack, with exactly one specific attack type. Attacks
fall into four main categories: Denial of Service (DoS),
Remote-to-Local (R2L), User-to-root (U2R), Probing
[6].
The goal in this task was to classify the test dataset
containing 311029 connection records into normal or
attack (The type of attack was not important).
It is important to note that:
1. The test data is not from the same probability
distribution as the training data.
2. It includes specific attack types not in the
training data.
3. The KDD 1999 Cup dataset has a very large
number of duplicate records [7].
In this paper, these duplicate samples were removed
from the dataset. After filtering out the duplicate
records, the total numbers of records in training and
testing datasets were 145586 and 77291 respectively.
Features in the KDD dataset are continuous,
discrete or symbolic. After removing duplicate data,
symbolic features were converted to numeric ones.
Continuous features must be discretized for computing
Seventh International Conference on Computer and Information Technology
0-7695-2983-6/07 $25.00 © 2007 IEEE
DOI 10.1109/CIT.2007.99
83