Research Article Machine-LearningApproachtoOptimizeSMOTERatioinClass ImbalanceDatasetforIntrusionDetection Jae-HyunSeo 1 andYong-HyukKim 2 1 Department of Computer Science and Engineering, Wonkwang University, 460 Iksandae-ro, Iksan-si, Jeonbuk 54649, Republic of Korea 2 School of Software, Kwangwoon University, 20 Kwangwoon-ro, Nowon-gu, Seoul 01897, Republic of Korea Correspondence should be addressed to Yong-Hyuk Kim; yhdfly@kw.ac.kr Received 30 April 2018; Revised 6 August 2018; Accepted 2 October 2018; Published 1 November 2018 Academic Editor: Giosu` e Lo Bosco Copyright © 2018 Jae-Hyun Seo and Yong-Hyuk Kim. is is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. e KDD CUP 1999 intrusion detection dataset was introduced at the third international knowledge discovery and data mining tools competition, and it has been widely used for many studies. e attack types of KDD CUP 1999 dataset are divided into four categories: user to root (U2R), remote to local (R2L), denial of service (DoS), and Probe. We use five classes by adding the normal class. We define the U2R, R2L, and Probe classes, which are each less than 1% of the total dataset, as rare classes. In this study, we attempt to mitigate the class imbalance of the dataset. Using the synthetic minority oversampling technique (SMOTE), we attempted to optimize the SMOTE ratios for the rare classes (U2R, R2L, and Probe). After randomly generating a number of tuples of SMOTE ratios, these tuples were used to create a numerical model for optimizing the SMOTE ratios of the rare classes. e support vector regression was used to create the model. We assigned each instance in the test dataset to the model and chose the best SMOTE ratios. e experiments using machine-learning techniques were conducted using the best ratios. e results using the proposed method were significantly better than those of previous approach and other related work. 1.Introduction e early IDS (intrusion detection system) [1] is divided into the host-based IDS (HIDS) and the network-based IDS (NIDS). HIDS has the advantage of analyzing the system log and resource usage information by the host and user. However, installing an IDS in each host increases the management points and wastes more resources. If network- level packet analysis is not possible and the attacker takes control of the system, the IDS may be interrupted. NIDS has advantages that it does not need to install an IDS on each host, and NIDS can perform analysis at the entire network level. However, there is a disadvantage in which it is possible to confirm only the attack via the IDS, and it is difficult to confirm the attack attempt at the system level. In early 2003, the IDS was losing the trust of users due to the problem of generating false positives. e causes of false positives are due to the development of erroneous rules, traffic irregularities, and limitations of pattern matching tests. Even though the IDS problem has not been solved to date, “pattern matching” is still being used as a basis for security solutions. Intrusion detection attacks [2] are divided into misuse detection and anomaly detection. In misuse detection, de- tected attacks are compared with existing signatures in the database to determine whether they are intrusions. While misuse detection detects only the known attacks, anomaly detection detects a new type of attack that has a pattern different from the normal traffic and the known attack types. Many researchers have studied intrusion detection. In general, researchers attempted to distinguish the normal class from attack classes using the publicly available intrusion detection evaluation dataset and to identify the exact attack type. However, the classification of rare classes in a huge real- time dataset requires a long computation time, and then it is difficult to achieve good efficiency. It is necessary to create and test many experimental datasets to improve classification performance by adjusting the class ratio. Hindawi Computational Intelligence and Neuroscience Volume 2018, Article ID 9704672, 11 pages https://doi.org/10.1155/2018/9704672