Flow-Based Dissection of Network Services Dominik Schatzmann ETH Zurich schatzmann@tik.ee.ethz.ch Wolfgang Mühlbauer ETH Zurich muehlbauer@tik.ee.ethz.ch Bernhard Tellenbach ETH Zurich tellenbach@tik.ee.ethz.ch Simon Leinen Switch simon.leinen@switch.ch Kavé Salamatian Université de Savoie kave.salamatian@univ-savoie.fr ABSTRACT The unprecedented success story of the Internet is largely due to rich and constantly emerging applications such as online social net- works, video streaming, etc. To characterize the Internet and its us- age, high-level metrics such as traffic volume or topology-related measures have been widely used in the past. However, researchers and network professionals still lack concepts to capture Internet services. Our approach for monitoring applications and services is to be agnostic. Starting at the granularity of flow-level information, we propose a system that can efficiently detect communication end points that offer service to multiple clients. In this technical report, we dive into some details with respect to the data structures and data processing techniques that are needed to achieve our goals. 1. INTRODUCTION Most likely, there is no system with a similar degree of diversity in its usages as the Internet. The history of the Internet during the past 40 years is characterized by the emergence of new and fast growing applications and services. Online social networks [10], video streaming [30], AJAX-based text processing, spreadsheet, or webmail applications [29], are just some examples. This rapidly changing environment makes it very challenging for network professionals (operators, engineers, application designers) to anticipate how the network and its resources will be used. Con- tinuous observations and measurements are therefore required to monitor the network. The research community is still facing the challenge to relate observable, high-level metrics such as traffic volume [5, 8, 12], or topology-related measures (e.g., [17, 23, 24, 28]), with higher-level concepts like applications and services. Nevertheless, where observable metrics like packet rates, etc., can be defined through objective means, the concepts of application or services are more difficult to capture, and are unfortunately ill- defined. To illustrate this difficulty, one can consider the case of an HTTP flow transferring a streaming video from a video-conference. Should this flow be categorized as a HTTP flow, or as video stream- ing, or as a video conference one? Similar issues exist for service definition. Being aware of these issues we adopt in this paper a Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted provided that copies are not made or distributed for profit or commercial advantage and that copies bear this no- tice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission. TIK Report, ETH Zurich, 2011. different approach. In the spirit of Estan et al. [9], we start from the granularity of flow-level information as observed at the border of a large ISP with more than 2 million internal hosts. Our approach for monitoring applications and services is to make minimal assumptions, and let the data speak by itself. For this reason and to avoid using the over- loaded terms “application” and “service”, we define the notion of a server socket: a server socket is identified by a tuple of IP address, port address, and protocol number, and acts as a concentrator, i.e., it communicates with multiple sockets. Alike, we refer to a host on which resides at least one server socket as a server host. Key to our detection of server sockets is a greedy approach. We first identify server sockets that communicate with a high number of communication end points, and then turn to the remaining sock- ets. However, we do not restrict ourselves to the top server sockets only. Importantly, our system also identifies server sockets that run on high ports and that are potentially not involved in a lot of traffic, hence frequently being overlooked. Finally, our system remembers already detected server sockets to cope with the high data rates of 14 - 40K NetFlow records per second. Our contributions are three- fold. Previous measurement studies [6,12] have mainly reported about network applications or services in the form of aggregate data, e.g., “what amount of traffic is P2P?”. Contrary to this, our objective is to identify and analyze the communication end points that offer service to a number of clients, and not to study exchanged traf- fic per se. In particular, we can detect services that are frequently overlooked (low-traffic applications on high ports), but, that as a whole turn out to be more relevant than previously thought. In to- tal, our approach provides a detailed view about server sockets, and about their locations in terms of hosts, subnets, Autonomous Sys- tems (ASes), or even countries. This can help network operators to anticipate how the network and its resources will be used [1, 2, 19]. To sum up, we have implemented a system that can process large-scale, flow-level data sets for the purpose of detecting server sockets. Our agnostic approach takes into account both TCP and UDP services, does not require TCP/SYN flags as used by Bartlett et al. [4], copes with noise from scanning [3], and with low reso- lution of NetFlow timestamps. Our proposed techniques allow to process our 5-day trace from 2010 within less than 4 days, and our 5-day trace from 2003 within one day, suggesting that processing in real time is feasible. The rest of this paper is structured as follows: Section 2 describes our data sets and provides a general overview of our approach. Sec- tion 3 explains the individual steps of our data processing. Finally, we review related work in Section 4 and conclude in Section 5.