Supervised Versus Unsupervised Discretization for Improving Network Intrusion Detection Doaa Hassan Computers and Systems Department National Telecommunication Institute Cairo, Egypt Email: doaa@nti.sci.eg Abstract—Discretization acts as an important preprocessing step in the data mining process that transforms the continuous values of data features into discrete ones. Machine learning (ML) algorithms such as Support Vector Machines (SVM) and Random Forests (RF) have been widely known for their robustness to the dimensionality of the data and hence they are used for classiﬁcation of high-dimensional data. Moreover, other ML algorithms such as Naive Bayes (NB) basically use feature discretization and also developed for classifying high dimensional data. This paper investigates the effect of pre- processing the dataset with either supervised or unsupervised discretization techniques on improving the performance of the three aforementioned ML algorithms. Those algorithms run on NSL-KDD, a high dimensional dataset and an improved version of KDD cup 99 dataset that is widely used for network intrusion detection. Our results show that preprocessing NSL-KDD with either supervised or unsupervised discretization techniques does not improve the performance of RF. On the other hand the supervised entropy-based discretization (EBD) and unsupervised Proportional k-Interval Discretization (PKID) lead to a little better performance of SVM and a signiﬁcant improvement in the performance of NB. Also the results show that the performance of NB with discretization is clearly less than the performance of SVM and RF either with or without discretization. Therefore the paper proposes an approach that combines discretization with the wrapper feature selection method in order to improve the performance of NB with discretization and make it so closed to the performance of either SVM or RF. Index Terms—network intrusion detection; data discretization; wrapper feature selection; classiﬁcation I. I NTRODUCTION Since the network attacks have increased in number and severity over the past few years, network intrusions detection is increasingly becoming a critical component to secure the computer network. Recently, optimizing the performance of intrusion detection using data mining techniques has received more attention from the research community [1] due to the huge volumes of network audit data analyzed to detect the complex and dynamic properties of intrusion behaviors. One of those techniques is to preprocess the network audit data using discretization algorithms [2] in order to transform the contin- ues features of such dataset into discrete ones by creating a set of disjoint intervals. Such an approach is mainly used to increase the capability of classiﬁer to successfully distinguish between the anomalous network trafﬁc and the normal one. Two common existing mechanisms for discretization are: the supervised method which uses the class information and the unsupervised one that does not use it at all [7], [3]. In this paper, we investigate the application of the two aforementioned discretization mechanisms to a high dimen- sional dataset that have been widely used for intrusion de- tection: The NSL-KDD dataset [5] which is an improved version of KDD cup 99 dataset [4]. The later includes a wide variety of intrusions simulated in a U.S military network environment. The former consists of reasonable number of records selected from the complete KDD dataset in order to enhance the performance of prediction. We study the effect of both discretization methods on improving the performance of three ML supervised algorithms including the Support Vector Machines (SVM) [26], Random Forests (RF) [27] and Naive Bayes (NB) [25] that run on NSL-KDD dataset. Those classiﬁers are used for detecting anomalous network connec- tions in this dataset. Moreover, they are widely known for their robustness to the curse of dimensionality problems and hence they are used for the classiﬁcation of high-dimensional data [6]. Our experimental results show that applying either supervised or unsupervised discretization to NSL-KDD dataset as pre-processing step before classiﬁcation does not to lead to improve the performance of RF on average, while it achieves in average a little better improvement in the performance of SVM and a notable improvement in the performance of NB, particularly using either the supervised EBD or the unsuper- vised PKID methods. Finally, the results show that though applying discretization to a dataset before classiﬁcation leads to a signiﬁcant improvement in the performance of NB that runs on the discretized dataset, the performance of NB with discretization is clearly still less than the performance of either SVM or RF either with or without discretization. This moti- vates us to propose a new approach in this paper that improves the performance of NB classiﬁer with discretization. This approach relies on combining discretization with the wrapper feature selection method, and then applying such combination to a dataset that has been preprocessed by discretization in order to get a reduced version of it. Next the NB is applied to the obtained reduced discretizated version of the dataset. Our experimental results show that the performance of NB using the proposed approach outperforms the performance of SVM without discretization and is very closed to the performance of either SVM or RF with discretization. International Journal of Computer Science and Information Security (IJCSIS), Vol. 14, No. 10, October 2016 583 https://sites.google.com/site/ijcsis/ ISSN 1947-5500