Identifying Sessions to Websites as an Aggregation of Related Flows Luis Miguel Torres, Eduardo Maga˜ na, Mikel Izal and Daniel Morato Departamento de Autom´ atica y Computaci´ on, Universidad P´ ublica de Navarra, Pamplona, Navarra, Spain. Email: [luismiguel.torres, eduardo.magana, mikel.izal, daniel.morato]@unavarra.es Abstract—In the field of traffic classification, previous efforts have been centered on identifying applications (HTTP, SMTP, FTP, etc) rather than the actual services that they provide (e- mail, file transfer, video streaming, etc.). Nowadays, however, a single application as HTTP can provide multiple services for the end-user. Network traffic for a web-based service is composed by a characteristic pattern of flows rather than characteristic individual flows. In this paper we study the traffic of different web-based services (webmail, social networks, video streaming and online newspapers) as summarized by Netflow-type records. A method that clusters flows that belong to the same web session using only the data provided by those records is also proposed. The obtained clusters will contain more information about the service whose traffic they comprise, and thus it will be easier to assign those clusters to the right service. Index Terms—web and Internet services, web sites, web min- ing, clustering methods. I. I NTRODUCTION As of today, a wide variety of services are provided through the Internet (e-mail, video streaming, P2P file sharing, social networks, etc). Identifying the traffic generated by them is interesting in order to gather data about the use of the net- works, prioritize important services or block undesired ones. Traditionally service and application were terms that could be used interchangeably. Because of this, most proposals in the literature deal with application identification as this was enough to find the services in the traffic. This is not true anymore as some applications now provide multiple services. For example, web traffic can be related to a myriad of different services from e-mail to video streaming as the web browser is more and more used as the sole interface with which the user accesses the Internet. Identifying applications has been traditionally possible by simple port number inspection. Nowadays, however, this is not a reliable technique as some new applications use random (or deliberately misleading) ports in order to avoid firewalls or detection. Signature-based methods and, more recently, behaviour-based ones are now popular. The former yield good results but need constant updates, fail against encryption and raise privacy concerns; the latter are somewhat less reliable and difficult to tune but offer a broader spectrum of detection and their popularity is rising. Our work is orientated towards behaviour-based techniques as we believe that their advantages outweigh their limitations. In any case, most proposals in the literature try to classify flows into applications by their individual characteristics. Nev- ertheless, most applications open multiple related connections. Individually, some of them may not be distinctive enough but together they can offer an accurate signature of the application’s operation. Because of this, the performance of the classification process could be improved if the objective was to classify not individual flows, but groups of them that belong to the same service. The objective of this paper is, precisely, to present a method that attempts to group flows that belong to the same web sessions into clusters that are easier to classify. For the moment, we have decided to limit our study to web traffic that, besides traditional web browsing, is nowadays associated with a myriad of other services. As, in the following sections, we will refer only to web traffic, all flows will be TCP. We define a session as a collection of TCP flows generated by the web browser while the user is accessing a specific service. For example, a webmail session would span all the connections opened by the web browser from the moment the user opened the login webpage of the mail service until he or she closes the browser, the tab or opens a different website in the same tab. When working with web traffic it is important to notice that recent studies [1] show that its characteristics have greatly changed from the ones described thoroughly in the 1990s [2]. This is partially the result of the introduction of HTTP 1.1 (persistent connections and pipelining have made obsolete the notion that every connection comprises a single request/response pair). But, the truth is that webpages have evolved drastically over the last years affecting the profile of web traffic. Because of dynamic content, web analytic tools, content distribution networks (CDN) or client-side processing, the server used by a website will change over time, connec- tions may stay open and thus occur concurrently to others that belong to a different session, different websites may open connections to the same servers or have unexpected traffic profiles. All of this makes clustering flows that belong to the same web session far from a trivial task. Our clustering algorithm uses only the information that we would get from a NetFlow type record [3]. These flow records provide manageable data and are collected in many links and networks throughout Internet, which makes a real- 978-1-4673-1391-9/12/$31.00 c 2012 IEEE