Toward an efﬁcient and scalable feature selection approach for internet trafﬁc classiﬁcation Adil Fahad a,⇑ , Zahir Tari a , Ibrahim Khalil a , Ibrahim Habib b , Hussein Alnuweiri c a School of Computer Science and Information Technology, RMIT University, Melbourne, Australia b Department of Electrical Engineering, City University of New York, United States c Electrical & Computer Engineering Program, Texas A& M University at Qatar, Doha, Qatar article info Article history: Received 20 April 2012 Received in revised form 4 March 2013 Accepted 7 April 2013 Available online xxxx Keywords: Feature selection Metrics Trafﬁc classiﬁcation abstract There is signiﬁcant interest in the network management and industrial security commu- nity about the need to identify the ‘‘best’’ and most relevant features for network trafﬁc in order to properly characterize user behaviour and predict future trafﬁc. The ability to eliminate redundant features is an important Machine Learning (ML) task because it helps to identify the best features in order to improve the classiﬁcation accuracy as well as to reduce the computational complexity related to the construction of the classiﬁer. In prac- tice, feature selection (FS) techniques can be used as a preprocessing step to eliminate irrel- evant features and as a knowledge discovery tool to reveal the ‘‘best’’ features in many soft computing applications. In this paper, we investigate the advantages and disadvantages of such FS techniques with new proposed metrics (namely goodness, stability and similarity). We continue our efforts toward developing an integrated FS technique that is built on the key strengths of existing FS techniques. A novel way is proposed to identify efﬁciently and accurately the ‘‘best’’ features by ﬁrst combining the results of some well-known FS techniques to ﬁnd consistent features, and then use the proposed concept of support to select a smallest set of features and cover data optimality. The empirical study over ten high-dimensional network trafﬁc data sets demonstrates signiﬁcant gain in accuracy and improved run-time performance of a classiﬁer compared to individual results produced by some well-known FS techniques. Ó 2013 Elsevier B.V. All rights reserved. 1. Introduction Network trafﬁc classiﬁcation has attracted a lot of inter- est in various areas, including Supervisory Control and Data Acquisition (SCADA) (industrial network) security monitoring, Internet user accounting, Quality of Service, and user behaviour. Classiﬁcation-based techniques [3,34] rely on a set of ‘‘good’’ features (that can provide a better class separability) in order to develop accurate and realistic trafﬁc models. Identiﬁcation of good features for classiﬁcation is a challenging task because: (i) this requires expert knowledge of the domain to understand which fea- tures are important, (ii) data sets may contain redundant and irrelevant features which greatly reduces the accuracy of the classiﬁcation process and (iii) the efﬁciency of the classiﬁers (e.g., based on Machine Learning techniques) is reduced when analysing a large number of features. In- deed, a number of studies (e.g. [5,26]) have shown that irrelevant/redundant features can degrade the predictive accuracy and intelligibility of the classiﬁcation model, maximise training and testing processing time of the clas- siﬁcation model, and increase storage requirements. This paper addresses such issues and proposes a new technique 1389-1286/$ - see front matter Ó 2013 Elsevier B.V. All rights reserved. http://dx.doi.org/10.1016/j.comnet.2013.04.005 ⇑ Corresponding author. Address: School of Computer Science and Information Technology, RMIT University, Melbourne, 3001 Victoria, Australia. Tel.: +61 420295188. E-mail addresses: alharthi.adil@gmail.com (A. Fahad), Zahir.tari@rmit. edu.au (Z. Tari), Ibrahim.khalil@rmit.edu.au (I. Khalil), habib@ccny. cuny.edu (I. Habib), hussein.alnuweiri@qatar.tamu.edu (H. Alnuweiri). Computer Networks xxx (2013) xxx–xxx Contents lists available at SciVerse ScienceDirect Computer Networks journal homepage: www.elsevier.com/locate/comnet Please cite this article in press as: A. Fahad et al., Toward an efﬁcient and scalable feature selection approach for internet trafﬁc classiﬁ- cation, Comput. Netw. (2013), http://dx.doi.org/10.1016/j.comnet.2013.04.005