978-1-4244-7173-7/10/$26.00 © 2010 IEEE
Using per-Source Measurements to Improve
Performance of Internet Traffic Classification
Stefano Bregni, Senior Member, IEEE, Diego Lucerna, Student Member, IEEE
Cristina Rottondi, Giacomo Verticale, Member, IEEE
Politecnico di Milano, Dept. of Electronics and Information, Piazza Leonardo Da Vinci 32, 20133 Milano, ITALY
Tel.: +39-02-2399.3503 – Fax: +39-02-2399.3413 – E-mail: {bregni, lucerna, vertical}@elet.polimi.it
Abstract ⎯ Obfuscated and encrypted protocols hinder traffic
classification by classical techniques such as port analysis or deep
packet inspection. Therefore, there is growing interest for classi-
fication algorithms based on statistical analysis of the length of
the first packets of flows. Most classifiers proposed in literature
are based on machine learning techniques and consider each flow
independently of previous source activity (per-flow analysis). In
this paper, we propose to use specific per-source information to
improve classification accuracy: the sequence of starting times of
flows generated by single sources may be analyzed along time to
estimate peculiar statistical parameters, in our case the exponent
α of the power law f
-α
that approximates the PSD of their count-
ing process. In our method, this measurement is used to train a
classifier in addition to the lengths of the first packets of the
flows. In our experiments, considering this additional per-source
information yielded the same accuracy as using only per-flow
data, but observing fewer packets in each flow and thus allowing
a quicker response. For the proposed classifier, we report per-
formance evaluation results obtained on sets of Internet traffic
traces collected in three sites.
Index Terms ⎯ Communication system traffic, Internet, long-
range dependence, traffic measurement (communication).
I. INTRODUCTION
he goal of Internet traffic classification is associating a
sequence of packets between two hosts on a transport port
pair (i.e., a flow) to the source application. The identification
of applications may be useful, for example, for link usage
analysis, for management of Quality of Service (QoS) or for
blocking traffic flows not conforming to local policies.
Common techniques used for Internet traffic classification
are based on the packet payload inspection or on well-known
transport protocol port numbers. However, newer Internet ap-
plications and protocols may use random or non-standard port
numbers, or employ packet encryption or traffic obfuscation,
making these techniques ineffective. Therefore, recent studies
consider using statistical analysis to assist traffic identification
and classification by way of machine-learning techniques.
A comprehensive survey of machine learning (ML) traffic
classification techniques is provided in [1] and [2]. Several
papers studied the performance of various classification algo-
rithms, based on statistical analysis of single flows.
The Nearest Neighbours (NN), Linear Discriminate Analy-
sis (LDA) and the Quadratic Discriminant Analysis (QDA) al-
gorithms were proposed in [3] to identify the QoS class of dif-
ferent applications. The authors identified a list of possible
features, calculated over the entire flow duration, and claimed
to obtain a classification error ranging from 2.5% to 12.6%.
The application of a Bayesian neural network was proposed
in [4]. The classification accuracy reached 99% when the
training and the test data were collected on the same day and
95% when the test data were collected eight months after the
training data. A Protocol Fingerprinting technique was pro-
posed in [5], based on observing packet lengths, inter-arrival
times, and packet arrival order. By classifying three applica-
tions (HTTP, SMTP, POP3), an accuracy greater than 91% was
obtained was obtained by observing as few as 4 or 5 packets.
A decision tree algorithm to classify Internet traffic was
proposed in [6]. The traffic features considered are the lengths
of the first 5 packets in both directions and their inter-arrival
times. Accuracy between 92% and 99% was achieved.
Traffic classification can be done offline or in real time.
When in real time, classification is required as earliest as pos-
sible, i.e. by looking at as few packets as possible, in order to
have the smallest delay to get the classification result. On the
other hand, the classification accuracy improves with the
number of observed packets. Thus, a trade-off between classi-
fication delay and accuracy must be sought.
Our work is motivated by the consideration that different
flows, originated by the same host, are likely to be started by
the same application or by a limited number of applications
running on the same host. A vast literature on traffic measure-
ment reports that traffic statistical characteristics depend on
the generating application [7][8]. Therefore, observing the
source activity along time can reveal some peculiar behaviour.
Moreover, it has been shown that traffic data series often
exhibit Long-Range Dependence (LRD), i.e. an asymptotic
power-law decrease of their Power Spectral Density (PSD) as
~f
-α
(for f → 0) or, equivalently, of their autocovariance.
Among early works, power-law PSD was identified in LAN
packet traffic in [7]. Authors concluded that was caused by the
nature of the data transfer applications.
Power-law PSD at packet level was identified also in WAN
traffic [9]. Authors conducted some investigation also at con-
nection level, concluding that Telnet and FTP control connec-
tions were well-modelled as Poisson processes, while FTP
data connections, NNTP, and SMTP were not.
Web-browsing traffic was studied in [10], by measuring the
sequence of file requests performed during each session (i.e.,
one execution of the web-browsing application), finding that
the reason of the power law lies in the long-tailed distributions
of the requested files and of the users’ "think-times".
T