I. J. Computer Network and Information Security, 2019, 4, 43-52 Published Online April 2019 in MECS (http://www.mecs-press.org/) DOI: 10.5815/ijcnis.2019.04.06 Copyright © 2019 MECS I.J. Computer Network and Information Security, 2019, 4, 43-52 Intrusion Detection using Machine Learning and Feature Selection Prachi, Heena Malhotra The NorthCap University, Gurgaon, India E-mail: {prachiah1985, malhotraheena17}@gmail.com Prabha Sharma The NorthCap University, Gurgaon, India E-mail: prabhasharma@ncuinda.edu Received: 13 February 2019; Accepted: 27 February 2019; Published: 08 April 2019 Abstract—Intrusion Detection is one of the most common approaches used in detecting malicious activities in any network by analyzing its traffic. Machine Learning (ML) algorithms help to study the high dimensional network traffic and identify abnormal flow in traffic with high accuracy. It is crucial to integrate machine learning algorithms with dimensionality reduction to decrease the underlying complexity of processing of huge datasets and detect intrusions within real-time. This paper evaluates 10 most popular ML algorithms on NSL-KDD dataset. Thereafter, the ranking of these algorithms is done to identify best performing ML algorithm on the basis of their performance on several parameters such as specificity, sensitivity, accuracy etc. After analyzing the top 4 algorithms, it becomes evident that they consume a lot of time while model building. Therefore, feature selection is applied to detect intrusions in as little time as possible without compromising accuracy. Experimental results clearly demonstrate that which algorithm works best with/without feature selection/reduction technique in terms of achieving high accuracy while minimizing the time taken in building the model. Index Terms—Network, Intrusion, Machine Learning, NSL-KDD Dataset, Feature Selection. I. INTRODUCTION Huge technological advancements in the field of communication industry massively increased the volume of data and its transmission across the globe via the internet. However, such advancements put valuable information and data at risk [1]. In today’s era, intrusion happens within a few seconds. This gives rise to the need for a stronger security system. An Intrusion Detection System (IDS) [2] analyzes the network traffic to identify malicious actions. Currently available IDS are divided in 2 major categories [3], namely, anomaly and misused based detection. Misuse detection identifies an intrusion on the basis of already known patterns, popularly called as signatures. Therefore, misuse detection is also referred as signature-based IDS (e.g. Snort [4]). Anomaly detection [5] identifies any unacceptable deviation from normal traffic. Unlike signature-based IDS, anomaly detection identifies zero-day attacks but generates a large number of false alarms. It also faces many challenges while dealing with huge amount of high-dimensional data. In order to analyze huge volumes of data, most of the existing IDS use Machine Learning (ML) algorithms to identify intrusions in an efficient manner. Although many techniques are available for detection purposes, quite a few are effective in producing high accuracy and low false positives for a huge amount of data [6]. Also, some ML algorithms perform better than others in terms of accuracy but take more training time for building models on large datasets. Hence, this results in an imminent need of consolidating ML algorithms with feature selection/reduction to obtain an accurate classification of reduced dimensional data while taking lesser time in building the model. An ideal IDS should be able to spot zero-day attacks with high accuracy and low false positives quickly so that intrusions can be prevented as early as possible [7]. Consequently, the Objective behind this paper is to design an intrusion detection model that integrates ML algorithms with the feature selection and feature reduction methods to detect intrusions with high accuracy and low false positives within a short span of time. This paper evaluates the performance of 10 most popular ML algorithms in WEKA [8] using NSL-KDD dataset [9]. Thereafter, algorithms are ranked based on their performances on certain parameters such as specificity, sensitivity, accuracy, the time taken in building the model, etc. To achieve high accuracy, less false alarms and minimum training time on large data sets, this paper applies dimensionality reduction methods on the best 4 ML algorithms. Later on, the performance of these best 4 ML algorithms is evaluated with/without applying feature selection/reduction methods in order to build an ideal model for intrusion detection. The organization of this paper is as follows. Work related to intrusion detection is highlighted in Section II.