Research Article Machine-LearningApproachtoOptimizeSMOTERatioinClass ImbalanceDatasetforIntrusionDetection Jae-HyunSeo 1 andYong-HyukKim 2 1 Department of Computer Science and Engineering, Wonkwang University, 460 Iksandae-ro, Iksan-si, Jeonbuk 54649, Republic of Korea 2 School of Software, Kwangwoon University, 20 Kwangwoon-ro, Nowon-gu, Seoul 01897, Republic of Korea Correspondence should be addressed to Yong-Hyuk Kim; yhdﬂy@kw.ac.kr Received 30 April 2018; Revised 6 August 2018; Accepted 2 October 2018; Published 1 November 2018 Academic Editor: Giosu` e Lo Bosco Copyright © 2018 Jae-Hyun Seo and Yong-Hyuk Kim. is is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. e KDD CUP 1999 intrusion detection dataset was introduced at the third international knowledge discovery and data mining tools competition, and it has been widely used for many studies. e attack types of KDD CUP 1999 dataset are divided into four categories: user to root (U2R), remote to local (R2L), denial of service (DoS), and Probe. We use ﬁve classes by adding the normal class. We deﬁne the U2R, R2L, and Probe classes, which are each less than 1% of the total dataset, as rare classes. In this study, we attempt to mitigate the class imbalance of the dataset. Using the synthetic minority oversampling technique (SMOTE), we attempted to optimize the SMOTE ratios for the rare classes (U2R, R2L, and Probe). After randomly generating a number of tuples of SMOTE ratios, these tuples were used to create a numerical model for optimizing the SMOTE ratios of the rare classes. e support vector regression was used to create the model. We assigned each instance in the test dataset to the model and chose the best SMOTE ratios. e experiments using machine-learning techniques were conducted using the best ratios. e results using the proposed method were signiﬁcantly better than those of previous approach and other related work. 1.Introduction e early IDS (intrusion detection system) [1] is divided into the host-based IDS (HIDS) and the network-based IDS (NIDS). HIDS has the advantage of analyzing the system log and resource usage information by the host and user. However, installing an IDS in each host increases the management points and wastes more resources. If network- level packet analysis is not possible and the attacker takes control of the system, the IDS may be interrupted. NIDS has advantages that it does not need to install an IDS on each host, and NIDS can perform analysis at the entire network level. However, there is a disadvantage in which it is possible to conﬁrm only the attack via the IDS, and it is diﬃcult to conﬁrm the attack attempt at the system level. In early 2003, the IDS was losing the trust of users due to the problem of generating false positives. e causes of false positives are due to the development of erroneous rules, traﬃc irregularities, and limitations of pattern matching tests. Even though the IDS problem has not been solved to date, “pattern matching” is still being used as a basis for security solutions. Intrusion detection attacks [2] are divided into misuse detection and anomaly detection. In misuse detection, de- tected attacks are compared with existing signatures in the database to determine whether they are intrusions. While misuse detection detects only the known attacks, anomaly detection detects a new type of attack that has a pattern diﬀerent from the normal traﬃc and the known attack types. Many researchers have studied intrusion detection. In general, researchers attempted to distinguish the normal class from attack classes using the publicly available intrusion detection evaluation dataset and to identify the exact attack type. However, the classiﬁcation of rare classes in a huge real- time dataset requires a long computation time, and then it is diﬃcult to achieve good eﬃciency. It is necessary to create and test many experimental datasets to improve classiﬁcation performance by adjusting the class ratio. Hindawi Computational Intelligence and Neuroscience Volume 2018, Article ID 9704672, 11 pages https://doi.org/10.1155/2018/9704672