Identifying Sessions to Websites as an Aggregation of Related Flows Luis Miguel Torres, Eduardo Maga˜ na, Mikel Izal and Daniel Morato Departamento de Autom´ atica y Computaci´ on, Universidad P´ ublica de Navarra, Pamplona, Navarra, Spain. Email: [luismiguel.torres, eduardo.magana, mikel.izal, daniel.morato]@unavarra.es Abstract—In the ﬁeld of trafﬁc classiﬁcation, previous efforts have been centered on identifying applications (HTTP, SMTP, FTP, etc) rather than the actual services that they provide (e- mail, ﬁle transfer, video streaming, etc.). Nowadays, however, a single application as HTTP can provide multiple services for the end-user. Network trafﬁc for a web-based service is composed by a characteristic pattern of ﬂows rather than characteristic individual ﬂows. In this paper we study the trafﬁc of different web-based services (webmail, social networks, video streaming and online newspapers) as summarized by Netﬂow-type records. A method that clusters ﬂows that belong to the same web session using only the data provided by those records is also proposed. The obtained clusters will contain more information about the service whose trafﬁc they comprise, and thus it will be easier to assign those clusters to the right service. Index Terms—web and Internet services, web sites, web min- ing, clustering methods. I. I NTRODUCTION As of today, a wide variety of services are provided through the Internet (e-mail, video streaming, P2P ﬁle sharing, social networks, etc). Identifying the trafﬁc generated by them is interesting in order to gather data about the use of the net- works, prioritize important services or block undesired ones. Traditionally service and application were terms that could be used interchangeably. Because of this, most proposals in the literature deal with application identiﬁcation as this was enough to ﬁnd the services in the trafﬁc. This is not true anymore as some applications now provide multiple services. For example, web trafﬁc can be related to a myriad of different services from e-mail to video streaming as the web browser is more and more used as the sole interface with which the user accesses the Internet. Identifying applications has been traditionally possible by simple port number inspection. Nowadays, however, this is not a reliable technique as some new applications use random (or deliberately misleading) ports in order to avoid ﬁrewalls or detection. Signature-based methods and, more recently, behaviour-based ones are now popular. The former yield good results but need constant updates, fail against encryption and raise privacy concerns; the latter are somewhat less reliable and difﬁcult to tune but offer a broader spectrum of detection and their popularity is rising. Our work is orientated towards behaviour-based techniques as we believe that their advantages outweigh their limitations. In any case, most proposals in the literature try to classify ﬂows into applications by their individual characteristics. Nev- ertheless, most applications open multiple related connections. Individually, some of them may not be distinctive enough but together they can offer an accurate signature of the application’s operation. Because of this, the performance of the classiﬁcation process could be improved if the objective was to classify not individual ﬂows, but groups of them that belong to the same service. The objective of this paper is, precisely, to present a method that attempts to group ﬂows that belong to the same web sessions into clusters that are easier to classify. For the moment, we have decided to limit our study to web trafﬁc that, besides traditional web browsing, is nowadays associated with a myriad of other services. As, in the following sections, we will refer only to web trafﬁc, all ﬂows will be TCP. We deﬁne a session as a collection of TCP ﬂows generated by the web browser while the user is accessing a speciﬁc service. For example, a webmail session would span all the connections opened by the web browser from the moment the user opened the login webpage of the mail service until he or she closes the browser, the tab or opens a different website in the same tab. When working with web trafﬁc it is important to notice that recent studies [1] show that its characteristics have greatly changed from the ones described thoroughly in the 1990s [2]. This is partially the result of the introduction of HTTP 1.1 (persistent connections and pipelining have made obsolete the notion that every connection comprises a single request/response pair). But, the truth is that webpages have evolved drastically over the last years affecting the proﬁle of web trafﬁc. Because of dynamic content, web analytic tools, content distribution networks (CDN) or client-side processing, the server used by a website will change over time, connec- tions may stay open and thus occur concurrently to others that belong to a different session, different websites may open connections to the same servers or have unexpected trafﬁc proﬁles. All of this makes clustering ﬂows that belong to the same web session far from a trivial task. Our clustering algorithm uses only the information that we would get from a NetFlow type record [3]. These ﬂow records provide manageable data and are collected in many links and networks throughout Internet, which makes a real- 978-1-4673-1391-9/12/$31.00 c 2012 IEEE