Journal of Information Security Research Volume 3 Number 4 December 2012 153
Ikram Chaïri, Souad Alaoui, Abdelouahid Lyhyaoui
LTI Lab, Department of Electrical and Industrial
ENSA of Tangier
Abdelmalek Essaâdi University
BP: 1818, Tanger Principal, Tanger
Morocco
{Chairikram, lyhyaoui}@gmail.com, souad_a2002@yahoo.fr
ABSTRACT: The majority of learning systems usually assume that training sets are balanced, however, in real world data
this hypothesis is not always true. The problem of between-class imbalance is a challenge that has attracted growing
attention from both academia and industry, because of its critical influence on the performance of learning systems. Many
solutions were proposed to resolve this problem: Generally, the common practice for dealing with imbalanced data sets is to
rebalance them artificially by using sampling methods. Unfortunately, these methods can’t give a high performance of
learning. In this paper, we propose a new method based on Sample Selection (SS), to deal with the problem of between class
imbalance. We consider that creating balance between classes by maintaining those examples located near the border line
improves the performance of the classifier. To reduce the computational cost of selecting all samples, we propose a clustering
method as a first step in order to determine the critical centers, and then select samples from those critical clusters. Experimental
results with Multi-Layer Perceptron (MLP) architecture, on well known Intrusion Detection data set, show that our approach
allows to attend the precision of Boosting methods, that we will explain how it can be considered like a SS method.
Keywords: Imbalanced Data, Intrusion Detection System, Nearest Opposing Pairs, Sample Selection, Boosting, Classification
Received: 12 August 2012, Revised 29 September 2012, Accepted 5 October 2012
© 2012 DLINE. All rights reserved
1. Introduction
The growth of raw data caused by the development of sciences and technologies has created an immense opportunity to
improve data engineering. The problem of imbalanced data emerged as more and more researchers realized that their data sets
were imbalanced and that this imbalance caused suboptimal classification performance.
In recent years, the imbalanced learning problem has generated a significant amount of interest from academia, industry, and
government funding agencies. The main problematic was and still to find a classifier which can learn from an imbalanced data
without ignoring the minority class. To deal with this class imbalance problem, many solutions were proposed, such as the case
of sampling methods, cost function method, kernel based method and active learning method [1], [2], [3]. Sampling methods still
the most widely used method to deal with the problem of imbalanced class [4]. Instead of the problem of imbalanced data, the
approximation of the misclassification error used in the learning system can also contribute negatively in decreasing the
accuracy and the quality of learning. That is why a different method of sample selection has been proposed to deal with this
Balancing Distribution of Intrusion Detection Data Using Sample Selection