Research Article
Machine-LearningApproachtoOptimizeSMOTERatioinClass
ImbalanceDatasetforIntrusionDetection
Jae-HyunSeo
1
andYong-HyukKim
2
1
Department of Computer Science and Engineering, Wonkwang University, 460 Iksandae-ro, Iksan-si, Jeonbuk 54649,
Republic of Korea
2
School of Software, Kwangwoon University, 20 Kwangwoon-ro, Nowon-gu, Seoul 01897, Republic of Korea
Correspondence should be addressed to Yong-Hyuk Kim; yhdfly@kw.ac.kr
Received 30 April 2018; Revised 6 August 2018; Accepted 2 October 2018; Published 1 November 2018
Academic Editor: Giosu` e Lo Bosco
Copyright © 2018 Jae-Hyun Seo and Yong-Hyuk Kim. is is an open access article distributed under the Creative Commons
Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is
properly cited.
e KDD CUP 1999 intrusion detection dataset was introduced at the third international knowledge discovery and data mining
tools competition, and it has been widely used for many studies. e attack types of KDD CUP 1999 dataset are divided into four
categories: user to root (U2R), remote to local (R2L), denial of service (DoS), and Probe. We use five classes by adding the normal
class. We define the U2R, R2L, and Probe classes, which are each less than 1% of the total dataset, as rare classes. In this study, we
attempt to mitigate the class imbalance of the dataset. Using the synthetic minority oversampling technique (SMOTE), we
attempted to optimize the SMOTE ratios for the rare classes (U2R, R2L, and Probe). After randomly generating a number of tuples
of SMOTE ratios, these tuples were used to create a numerical model for optimizing the SMOTE ratios of the rare classes. e
support vector regression was used to create the model. We assigned each instance in the test dataset to the model and chose the
best SMOTE ratios. e experiments using machine-learning techniques were conducted using the best ratios. e results using
the proposed method were significantly better than those of previous approach and other related work.
1.Introduction
e early IDS (intrusion detection system) [1] is divided into
the host-based IDS (HIDS) and the network-based IDS
(NIDS). HIDS has the advantage of analyzing the system log
and resource usage information by the host and user.
However, installing an IDS in each host increases the
management points and wastes more resources. If network-
level packet analysis is not possible and the attacker takes
control of the system, the IDS may be interrupted. NIDS
has advantages that it does not need to install an IDS on
each host, and NIDS can perform analysis at the entire
network level. However, there is a disadvantage in which it is
possible to confirm only the attack via the IDS, and it is
difficult to confirm the attack attempt at the system level. In
early 2003, the IDS was losing the trust of users due to the
problem of generating false positives. e causes of false
positives are due to the development of erroneous rules,
traffic irregularities, and limitations of pattern matching
tests. Even though the IDS problem has not been solved to
date, “pattern matching” is still being used as a basis for
security solutions.
Intrusion detection attacks [2] are divided into misuse
detection and anomaly detection. In misuse detection, de-
tected attacks are compared with existing signatures in the
database to determine whether they are intrusions. While
misuse detection detects only the known attacks, anomaly
detection detects a new type of attack that has a pattern
different from the normal traffic and the known attack types.
Many researchers have studied intrusion detection. In
general, researchers attempted to distinguish the normal class
from attack classes using the publicly available intrusion
detection evaluation dataset and to identify the exact attack
type. However, the classification of rare classes in a huge real-
time dataset requires a long computation time, and then it is
difficult to achieve good efficiency. It is necessary to create and
test many experimental datasets to improve classification
performance by adjusting the class ratio.
Hindawi
Computational Intelligence and Neuroscience
Volume 2018, Article ID 9704672, 11 pages
https://doi.org/10.1155/2018/9704672