978-1-4244-7173-7/10/$26.00 © 2010 IEEE Using per-Source Measurements to Improve Performance of Internet Traffic Classification Stefano Bregni, Senior Member, IEEE, Diego Lucerna, Student Member, IEEE Cristina Rottondi, Giacomo Verticale, Member, IEEE Politecnico di Milano, Dept. of Electronics and Information, Piazza Leonardo Da Vinci 32, 20133 Milano, ITALY Tel.: +39-02-2399.3503 – Fax: +39-02-2399.3413 – E-mail: {bregni, lucerna, vertical}@elet.polimi.it Abstract ⎯ Obfuscated and encrypted protocols hinder traffic classification by classical techniques such as port analysis or deep packet inspection. Therefore, there is growing interest for classi- fication algorithms based on statistical analysis of the length of the first packets of flows. Most classifiers proposed in literature are based on machine learning techniques and consider each flow independently of previous source activity (per-flow analysis). In this paper, we propose to use specific per-source information to improve classification accuracy: the sequence of starting times of flows generated by single sources may be analyzed along time to estimate peculiar statistical parameters, in our case the exponent α of the power law f -α that approximates the PSD of their count- ing process. In our method, this measurement is used to train a classifier in addition to the lengths of the first packets of the flows. In our experiments, considering this additional per-source information yielded the same accuracy as using only per-flow data, but observing fewer packets in each flow and thus allowing a quicker response. For the proposed classifier, we report per- formance evaluation results obtained on sets of Internet traffic traces collected in three sites. Index Terms ⎯ Communication system traffic, Internet, long- range dependence, traffic measurement (communication). I. INTRODUCTION he goal of Internet traffic classification is associating a sequence of packets between two hosts on a transport port pair (i.e., a flow) to the source application. The identification of applications may be useful, for example, for link usage analysis, for management of Quality of Service (QoS) or for blocking traffic flows not conforming to local policies. Common techniques used for Internet traffic classification are based on the packet payload inspection or on well-known transport protocol port numbers. However, newer Internet ap- plications and protocols may use random or non-standard port numbers, or employ packet encryption or traffic obfuscation, making these techniques ineffective. Therefore, recent studies consider using statistical analysis to assist traffic identification and classification by way of machine-learning techniques. A comprehensive survey of machine learning (ML) traffic classification techniques is provided in [1] and [2]. Several papers studied the performance of various classification algo- rithms, based on statistical analysis of single flows. The Nearest Neighbours (NN), Linear Discriminate Analy- sis (LDA) and the Quadratic Discriminant Analysis (QDA) al- gorithms were proposed in [3] to identify the QoS class of dif- ferent applications. The authors identified a list of possible features, calculated over the entire flow duration, and claimed to obtain a classification error ranging from 2.5% to 12.6%. The application of a Bayesian neural network was proposed in [4]. The classification accuracy reached 99% when the training and the test data were collected on the same day and 95% when the test data were collected eight months after the training data. A Protocol Fingerprinting technique was pro- posed in [5], based on observing packet lengths, inter-arrival times, and packet arrival order. By classifying three applica- tions (HTTP, SMTP, POP3), an accuracy greater than 91% was obtained was obtained by observing as few as 4 or 5 packets. A decision tree algorithm to classify Internet traffic was proposed in [6]. The traffic features considered are the lengths of the first 5 packets in both directions and their inter-arrival times. Accuracy between 92% and 99% was achieved. Traffic classification can be done offline or in real time. When in real time, classification is required as earliest as pos- sible, i.e. by looking at as few packets as possible, in order to have the smallest delay to get the classification result. On the other hand, the classification accuracy improves with the number of observed packets. Thus, a trade-off between classi- fication delay and accuracy must be sought. Our work is motivated by the consideration that different flows, originated by the same host, are likely to be started by the same application or by a limited number of applications running on the same host. A vast literature on traffic measure- ment reports that traffic statistical characteristics depend on the generating application [7][8]. Therefore, observing the source activity along time can reveal some peculiar behaviour. Moreover, it has been shown that traffic data series often exhibit Long-Range Dependence (LRD), i.e. an asymptotic power-law decrease of their Power Spectral Density (PSD) as ~f -α (for f → 0) or, equivalently, of their autocovariance. Among early works, power-law PSD was identified in LAN packet traffic in [7]. Authors concluded that was caused by the nature of the data transfer applications. Power-law PSD at packet level was identified also in WAN traffic [9]. Authors conducted some investigation also at con- nection level, concluding that Telnet and FTP control connec- tions were well-modelled as Poisson processes, while FTP data connections, NNTP, and SMTP were not. Web-browsing traffic was studied in [10], by measuring the sequence of file requests performed during each session (i.e., one execution of the web-browsing application), finding that the reason of the power law lies in the long-tailed distributions of the requested files and of the users’ "think-times". T