Toward an efficient and scalable feature selection approach for internet traffic classification Adil Fahad a, , Zahir Tari a , Ibrahim Khalil a , Ibrahim Habib b , Hussein Alnuweiri c a School of Computer Science and Information Technology, RMIT University, Melbourne, Australia b Department of Electrical Engineering, City University of New York, United States c Electrical & Computer Engineering Program, Texas A& M University at Qatar, Doha, Qatar article info Article history: Received 20 April 2012 Received in revised form 4 March 2013 Accepted 7 April 2013 Available online xxxx Keywords: Feature selection Metrics Traffic classification abstract There is significant interest in the network management and industrial security commu- nity about the need to identify the ‘‘best’’ and most relevant features for network traffic in order to properly characterize user behaviour and predict future traffic. The ability to eliminate redundant features is an important Machine Learning (ML) task because it helps to identify the best features in order to improve the classification accuracy as well as to reduce the computational complexity related to the construction of the classifier. In prac- tice, feature selection (FS) techniques can be used as a preprocessing step to eliminate irrel- evant features and as a knowledge discovery tool to reveal the ‘‘best’’ features in many soft computing applications. In this paper, we investigate the advantages and disadvantages of such FS techniques with new proposed metrics (namely goodness, stability and similarity). We continue our efforts toward developing an integrated FS technique that is built on the key strengths of existing FS techniques. A novel way is proposed to identify efficiently and accurately the ‘‘best’’ features by first combining the results of some well-known FS techniques to find consistent features, and then use the proposed concept of support to select a smallest set of features and cover data optimality. The empirical study over ten high-dimensional network traffic data sets demonstrates significant gain in accuracy and improved run-time performance of a classifier compared to individual results produced by some well-known FS techniques. Ó 2013 Elsevier B.V. All rights reserved. 1. Introduction Network traffic classification has attracted a lot of inter- est in various areas, including Supervisory Control and Data Acquisition (SCADA) (industrial network) security monitoring, Internet user accounting, Quality of Service, and user behaviour. Classification-based techniques [3,34] rely on a set of ‘‘good’’ features (that can provide a better class separability) in order to develop accurate and realistic traffic models. Identification of good features for classification is a challenging task because: (i) this requires expert knowledge of the domain to understand which fea- tures are important, (ii) data sets may contain redundant and irrelevant features which greatly reduces the accuracy of the classification process and (iii) the efficiency of the classifiers (e.g., based on Machine Learning techniques) is reduced when analysing a large number of features. In- deed, a number of studies (e.g. [5,26]) have shown that irrelevant/redundant features can degrade the predictive accuracy and intelligibility of the classification model, maximise training and testing processing time of the clas- sification model, and increase storage requirements. This paper addresses such issues and proposes a new technique 1389-1286/$ - see front matter Ó 2013 Elsevier B.V. All rights reserved. http://dx.doi.org/10.1016/j.comnet.2013.04.005 Corresponding author. Address: School of Computer Science and Information Technology, RMIT University, Melbourne, 3001 Victoria, Australia. Tel.: +61 420295188. E-mail addresses: alharthi.adil@gmail.com (A. Fahad), Zahir.tari@rmit. edu.au (Z. Tari), Ibrahim.khalil@rmit.edu.au (I. Khalil), habib@ccny. cuny.edu (I. Habib), hussein.alnuweiri@qatar.tamu.edu (H. Alnuweiri). Computer Networks xxx (2013) xxx–xxx Contents lists available at SciVerse ScienceDirect Computer Networks journal homepage: www.elsevier.com/locate/comnet Please cite this article in press as: A. Fahad et al., Toward an efficient and scalable feature selection approach for internet traffic classifi- cation, Comput. Netw. (2013), http://dx.doi.org/10.1016/j.comnet.2013.04.005