GT: picking up the truth from the ground for Internet traffic * F. Gringoli, L. Salgarelli, M. Dusi Universit ` a di Brescia N. Cascarano, F. Risso Politecnico di Torino K. C. Claffy CAIDA ABSTRACT Much of Internet traffic modeling, firewall, and intrusion detection research requires traces where some ground truth regarding application and protocol is associated with each packet or flow. This paper presents the design, development and experimental evaluation of gt, an open source software toolset for associating ground truth information with Inter- net traffic traces. By probing the monitored host’s kernel to obtain information on active Internet sessions, gt gathers ground truth at the application level. Preliminary exper- imental results show that gt’s effectiveness comes at little cost in terms of overhead on the hosting machines. Fur- thermore, when coupled with other packet inspection mech- anisms, gt can derive ground truth not only in terms of ap- plications (e.g., e-mail), but also in terms of protocols (e.g., SMTP vs. POP3). Categories and Subject Descriptors C.2.3 [Computer-Communication Networks]: Network Operations General Terms Experimentation, Measurement Keywords Ground truth, application layer, transport layer 1. INTRODUCTION The majority of research activities carried on under the umbrella of Internet traffic analysis requires the association of application and protocol ground truth information with traffic traces. Most mechanisms used today to link ground truth meta-data to Internet traffic traces roughly conform to one of the two following procedures. One approach is to create a trace manually by instantiating a realistic pool of applications on many machines. However, such captured traffic typically lacks characteristics that human behavior can induce. The second approach is to record traffic on a live network, and apply deep packet inspection (DPI – pattern-matching filters) to each packet’s payload, usually complemented by port analysis. But DPI is ineffective when traffic is encrypted and ambiguous when different protocols exhibit similar signatures, and port-based analysis is rapidly becoming useless. * This work was supported in part by a grant from Cisco Systems, Inc. This paper introduces “gt”, a new mechanism to provide ground truth at the application level. The gt architecture is based on a client tool that, by monitoring a host’s kernel, associates each packet flow with the name of its controlling application, and transmits the collected information to a back-end. The post–processing toolset “ipclass” analyzes the traffic captured at the network border by an independent probe and associates each flow with its application label, lay- ing a reputable foundation for the establishment of ground truth for that flow. The tool works on many widespread operating systems, and it is freely available under an Open Source (BSD) license [1]. We evaluate the effectiveness of the toolset in two network environments, with the help of colleagues who consented to be monitored with gt. Our experiments show that the gt architecture can tag up to 99% of the bytes and 95% of the flows on all platforms, while consuming about 5% of the resources on 2GHz CPUs. In order to derive ground truth both at the application level (e.g., attaching the “Firefox” or “Thunderbird” label to a given flow), and at the protocol level (e.g., attaching the “SMTP” or “HTTP” label), we include in gt a DPI- based mechanism. We show that the combination of the application label with payload inspection can significantly improve the accuracy of ground truth meta-data 1 compared to current approaches that rely solely on DPI. The rest of the paper is organized as follows. Section 2 covers the related work. Section 3 presents the gt architec- ture, discussing the main technical aspects of its implemen- tation. Section 4 describes our experimental testbed and Section 5 the results of our tests on gt. In light of such results, we further discuss two of the main design choices in Section 6, while Section 7 concludes the paper. 2. RELATED WORK Payload inspection, if traces contain at least a portion of the payload, is one of the most popular techniques to es- tablish a form of protocol ground truth [2, 3]. Port-based mechanisms are also used, especially for those working with publicly available traces whose payload has been entirely stripped [4, 5]. However, both port and payload-based tech- niques can only provide an estimate of the protocol being carried, in contrast with gt which unequivocally tags the flow with the application that generated it. 1 Although ground truth should be, by definition, accurate, not all meta-data that specifies ground truth for Internet traffic traces is correct, since it is often derived with inaccu- rate means, such as port analysis or DPI.