Trafﬁc Causality Graphs: Proﬁling Network Applications through Temporal and Spatial Causality of Flows Hirochika Asai The University of Tokyo panda@hongo.wide.ad.jp Kensuke Fukuda NII / PRESTO JST kensuke@nii.ac.jp Hiroshi Esaki The University of Tokyo hiroshi@wide.ad.jp Abstract—Trafﬁc causality graphs (TCGs) are proposed for visualizing and analyzing the temporal and spatial causality of ﬂows to proﬁle network applications without inspecting packet payload. A key idea of TCGs is to focus on the causality of indi- vidual ﬂows composed of different application protocols rather than a set of host ﬂows. This idea enables us to analyze temporal interactions between ﬂows, such as the temporal manner of ﬂow generation by identical application programs and interactions between incoming and outgoing ﬂows. We demonstrate the effectiveness of TCGs for proﬁling network applications in case studies with ground truth datasets. The results show that the simple features of TCGs are discriminative for proﬁling network applications and that TCGs are also advantageous for proﬁling application programs, such as user agents of Web browsers and proxies that cannot be classiﬁed by existing approaches; this enables us to identify a speciﬁc application program that uses the same protocol as other programs. In addition to the TCG features, the visualization of TCGs reveals the causality of each ﬂow, which consequently helps network operators to identify the root causes of other ﬂows, such as malicious ones. I. I NTRODUCTION Internet usage has become diversiﬁed and various network applications are run on the Internet. In this environment, trafﬁc classiﬁcation is one of the key technologies for IP network management tasks, such as analysis of security in- cidents, network topology design, and trafﬁc engineering. The simplest trafﬁc classiﬁcation method is based on the source and destination port numbers of the transport layer (e.g., TCP and UDP) [1]. However, a problem for the port- based method is that port numbers are not statically bound to each application. For example, network applications can use non-standard ports, especially when there are ﬁrewall port restrictions. Moreover, some network applications such as peer-to-peer applications may use a random port. Cases such as these make it difﬁcult to classify trafﬁc according to port numbers. Many advanced techniques that do not rely only on port numbers have been proposed for proﬁling network application trafﬁc. Signature-based trafﬁc classiﬁers [2], [3], [4], [5], [6], [7] identify applications from network trafﬁc by inspecting packet payloads (i.e., application data). However, packet inspection creates some privacy concerns, and it is difﬁcult to conduct when the data is encrypted. To solve these privacy and encryption problems, statistical approaches [8], [9], [10] have been proposed to classify applications from network trafﬁc. These approaches use statistical properties, such as the probability distribution of packet inter-arrival time and of packet size, instead of packet payload inspection. These properties are useful for detecting anomalies in network ﬂows, and consequently, they have also been used in anomaly detection methods [11]. An intrinsic approach [12] not relying on signatures or statistical properties checks IP addresses in ﬂows and Web contents found in search engine results corresponding to an IP address to proﬁle end-hosts. However, as the authors mentioned, it cannot proﬁle end-hosts using P2P applications, and applying it to application proﬁling is difﬁcult because end-hosts, especially end-user hosts, use multiple applications. Other approaches [13], [14], [15], [16], [17] use information on spatial interactions between hosts or ﬂows for trafﬁc classiﬁcation. However, these approaches do not focus on the causality of ﬂows and cannot easily proﬁle application programs such as Web browsers/proxies without payload inspection, though they might succeed in proﬁling certain application classes, such as the Web browsing and P2P ﬁle-sharing classes. Moreover, since these approaches neglect the causality, the root causes of ﬂows cannot be identiﬁed. In summary, the main problem of existing approaches is that they cannot proﬁle application programs well, although application program proﬁling is important in network opera- tion [18]. The existing approaches do not focus on the temporal order of ﬂows, despite applications generating ﬂows in a certain temporal manner that varies by application type; for example, Web browsers ﬁrst resolve a domain name by DNS and then retrieve a content by HTTP. In addition to temporal order of ﬂows, the approaches also ignore interactions between incoming and outgoing ﬂows. For example, a Web proxy partly behaves like a Web client; it resolves a domain name and retrieves content from the original Web server, after receiving an HTTP request. Therefore, the temporal and spatial causality of ﬂows is highly signiﬁcant for proﬁling network applications. One practical use of this application program proﬁling is to identify a speciﬁc application program that uses the same protocol as other programs but has security problems. In this work, we focus on the temporal and spatial causality of individual ﬂows for proﬁling network applications, without looking at packet payload. Our ﬁnal goal is to automatically proﬁle application classes and to automatically proﬁle appli-