MALAWI: Aggregated longitudinal analysis of the MAWI dataset João Taveira Araújo University College London j.araujo@ee.ucl.ac.uk Kensuke Fukuda National Institute of Informatics kensuke@nii.ac.jp 1. ABSTRACT The importance of measurement and analysis of In- ternet traffic is constantly reasserted as the Internet expands and shifts in often unpredictable ways. The MAWI dataset [1], which provides daily traces across a trans-Pacific link over the past decade, has often been used to analyze traffic from a network perspective. In this paper we focus on information contained at the transport layer and present MALAWI (Measurement and Aggregated Longitudinal Analysis on the WIDE Internet) a new dataset derived from MAWI which ex- tracts information from traced TCP flows and aggre- gates these statistics by geographical location, AS and network prefix. We briefly illustrate the usefulness of this new dataset by analyzing a month of data to ob- serve the impact of the Tohoku earthquake on delay and loss. 2. INTRODUCTION Measurement and analysis of Internet traffic is criti- cal not only for a deeper understanding of the evolving nature of Internet as a whole but also as an input to designing new elements which are able to act efficiently within the current architecture. The MAWI dataset contains daily 15-minute traffic traces with transport headers spanning the past decade. While the dataset has been available to the wider com- munity for some time, the short timespan of each trace has lent it to further study in areas where the inexis- tence of complete flow traces is less significant, such as Internet anomalies [3] or where characterization of traf- fic is packet-based [2], relying only on the inspection of the IP header and port numbers. Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. ACM CoNEXT Student Workshop, December 6, 2011, Tokyo, Japan. Copyright 2011 ACM 978-1-4503-1042-0/11/0012 ...$10.00. The network layer alone however does not contain much of the information which defines a user’s experi- ence. To understand many of the inherent characteris- tics perceived by an application it is necessary to ana- lyze TCP, extracting values for relevant metrics such as the round trip time (RTT) and loss. This flow analysis requires the partial reconstruction of TCP flows to ob- tain robust measurements and has been attempted be- fore [4] but limited in scope to complete, bidirectional flows. Additionally, aggregating these statistics in a meaningful manner poses significant challenges due to both the scale of data generated and the availability of external sources to provide context to the original MAWI traces. MALAWI (Measurement and Aggregated Longitu- dinal Analysis on the WIDE Internet) builds on the MAWI dataset and will make available flow level statis- tics aggregated by source and destination prefixes, au- tonomous system (AS) and geographic location. Both prefix and AS information for each IP is extracted from information contained within the daily BGP routing up- dates, which are also collected from within WIDE since 2004. Currently geolocation information is obtained from freely available sources as far back as 2008. Prior to this date, country level information is used based on information provided by regional Network Information Centers (NIC). With the resulting dataset we hope to provide re- searchers with greater insight into essential metrics with- out the limitations imposed by IP address anonymiza- tion. While the dataset is intended for longitudinal studies of the evolution of TCP behaviour, we illustrate the potential of MALAWI by analyzing the traces for March of 2011 and viewing effect of the Tohoku earth- quake on both RTT and loss. 3. TOHOKU EARTHQUAKE While the devastating effects of the Tohoku earth- quake on Friday 11th, March 2011 are well known, the impact on network operations within Japan are less than clear. While a significant proportion of both users and infrastructure were in largely unaffected regions, it