Traffic Causality Graphs: Profiling Network Applications through Temporal and Spatial Causality of Flows Hirochika Asai The University of Tokyo panda@hongo.wide.ad.jp Kensuke Fukuda NII / PRESTO JST kensuke@nii.ac.jp Hiroshi Esaki The University of Tokyo hiroshi@wide.ad.jp Abstract—Traffic causality graphs (TCGs) are proposed for visualizing and analyzing the temporal and spatial causality of flows to profile network applications without inspecting packet payload. A key idea of TCGs is to focus on the causality of indi- vidual flows composed of different application protocols rather than a set of host flows. This idea enables us to analyze temporal interactions between flows, such as the temporal manner of flow generation by identical application programs and interactions between incoming and outgoing flows. We demonstrate the effectiveness of TCGs for profiling network applications in case studies with ground truth datasets. The results show that the simple features of TCGs are discriminative for profiling network applications and that TCGs are also advantageous for profiling application programs, such as user agents of Web browsers and proxies that cannot be classified by existing approaches; this enables us to identify a specific application program that uses the same protocol as other programs. In addition to the TCG features, the visualization of TCGs reveals the causality of each flow, which consequently helps network operators to identify the root causes of other flows, such as malicious ones. I. I NTRODUCTION Internet usage has become diversified and various network applications are run on the Internet. In this environment, traffic classification is one of the key technologies for IP network management tasks, such as analysis of security in- cidents, network topology design, and traffic engineering. The simplest traffic classification method is based on the source and destination port numbers of the transport layer (e.g., TCP and UDP) [1]. However, a problem for the port- based method is that port numbers are not statically bound to each application. For example, network applications can use non-standard ports, especially when there are firewall port restrictions. Moreover, some network applications such as peer-to-peer applications may use a random port. Cases such as these make it difficult to classify traffic according to port numbers. Many advanced techniques that do not rely only on port numbers have been proposed for profiling network application traffic. Signature-based traffic classifiers [2], [3], [4], [5], [6], [7] identify applications from network traffic by inspecting packet payloads (i.e., application data). However, packet inspection creates some privacy concerns, and it is difficult to conduct when the data is encrypted. To solve these privacy and encryption problems, statistical approaches [8], [9], [10] have been proposed to classify applications from network traffic. These approaches use statistical properties, such as the probability distribution of packet inter-arrival time and of packet size, instead of packet payload inspection. These properties are useful for detecting anomalies in network flows, and consequently, they have also been used in anomaly detection methods [11]. An intrinsic approach [12] not relying on signatures or statistical properties checks IP addresses in flows and Web contents found in search engine results corresponding to an IP address to profile end-hosts. However, as the authors mentioned, it cannot profile end-hosts using P2P applications, and applying it to application profiling is difficult because end-hosts, especially end-user hosts, use multiple applications. Other approaches [13], [14], [15], [16], [17] use information on spatial interactions between hosts or flows for traffic classification. However, these approaches do not focus on the causality of flows and cannot easily profile application programs such as Web browsers/proxies without payload inspection, though they might succeed in profiling certain application classes, such as the Web browsing and P2P file-sharing classes. Moreover, since these approaches neglect the causality, the root causes of flows cannot be identified. In summary, the main problem of existing approaches is that they cannot profile application programs well, although application program profiling is important in network opera- tion [18]. The existing approaches do not focus on the temporal order of flows, despite applications generating flows in a certain temporal manner that varies by application type; for example, Web browsers first resolve a domain name by DNS and then retrieve a content by HTTP. In addition to temporal order of flows, the approaches also ignore interactions between incoming and outgoing flows. For example, a Web proxy partly behaves like a Web client; it resolves a domain name and retrieves content from the original Web server, after receiving an HTTP request. Therefore, the temporal and spatial causality of flows is highly significant for profiling network applications. One practical use of this application program profiling is to identify a specific application program that uses the same protocol as other programs but has security problems. In this work, we focus on the temporal and spatial causality of individual flows for profiling network applications, without looking at packet payload. Our final goal is to automatically profile application classes and to automatically profile appli-