Web Content Cartography Bernhard Ager T-Labs/TU Berlin bernhard@net.t-labs.tu-berlin.de Wolfgang Mühlbauer ETH Zurich muehlbauer@tik.ee.ethz.ch Georgios Smaragdakis T-Labs/TU Berlin georgios@net.t-labs.tu-berlin.de Steve Uhlig T-Labs/TU Berlin steve@net.t-labs.tu-berlin.de ABSTRACT Recent studies show that a significant part of Internet traffic is de- livered through Web-based applications. To cope with the increas- ing demand for Web content, large scale content hosting and de- livery infrastructures, such as data-centers and content distribution networks, are continuously being deployed. Being able to identify and classify such hosting infrastructures is helpful not only to con- tent producers, content providers, and ISPs, but also to the research community at large. For example, to quantify the degree of hosting infrastructure deployment in the Internet or the replication of Web content. In this paper, we introduce Web Content Cartography, i. e., the identification and classification of content hosting and delivery in- frastructures. We propose a lightweight and fully automated ap- proach to discover hosting infrastructures based only on DNS mea- surements and BGP routing table snapshots. Our experimental re- sults show that our approach is feasible even with a limited num- ber of well-distributed vantage points. We find that some popular content is served exclusively from specific regions and ASes. Fur- thermore, our classification enables us to derive content-centric AS rankings that complement existing AS rankings and shed light on recent observations about shifts in inter-domain traffic and the AS topology. Categories and Subject Descriptors C.2.5 [Computer-Communication Networks]: Local and Wide- Area Networks—Internet General Terms Measurement Keywords Content delivery, hosting infrastructures, measurement, DNS The measurement traces are available from http://www.inet.tu-berlin.de/?id=cartography Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. IMC’11, November 2–4, 2011, Berlin, Germany. Copyright 2011 ACM 978-1-4503-1013-0/11/11 ...$10.00. 1. INTRODUCTION Today’s demand for Web content in the Internet is enormous, re- flecting the value Internet users give to content [18]. Recent traffic studies [15, 12, 22, 27] show that Web-based applications are again very popular. To cope with this demand, Web-based applications and Web content producers use scalable and cost-effective hosting and content delivery infrastructures. These infrastructures, which we refer to as hosting infrastructures throughout this paper, have multiple choices on how and where to place their servers. Leighton differentiates between three options for Web content delivery [24]: (i) centralized hosting, (ii) data-center-based content distribution network (CDN), (iii) cache-based CDNs. Approaches (ii) and (iii) allow to scale content delivery by distributing the con- tent onto a dedicated hosting infrastructure. This hosting infrastruc- ture can be composed of a few large data-centers, a large number of caches, or any combination. In many cases, DNS is used by the hosting infrastructure to select the server from which a user will obtain content [20, 37, 7, 30]. The deployment of hosting infrastructures is dynamic and flexi- ble in multiple ways, e.g.: increasing the size of the existing host- ing infrastructure, changing peerings with ISPs, placing parts of the infrastructure inside ISP networks. Therefore, being able to iden- tify and classify hosting infrastructures in an automated manner is a step towards understanding this complex ecosystem, and an enabler for many applications. Content producers can benefit from under- standing the footprint of hosting infrastructures to place content close to their customer base. For CDNs, a map of hosting infras- tructures can assist them in improving their competitiveness in the content delivery market. For ISPs, it is important to know which hosting infrastructures deliver a specific content and at which loca- tions to make relevant peering decisions. The research community needs a better understanding of the evolving ecosystem of hosting infrastructures, given its importance as a driver in the evolution of the Internet. As demand drives hosting infrastructures to make a given content available at multiple locations, identifying a particular hosting in- frastructure requires sampling its location diversity. Previous work has attempted to discover specific hosting infrastructures in an ex- tensive manner, e. g., Akamai [36, 35, 17]. Such studies rely on the knowledge of a signature that identifies the target infrastructure, e. g., CNAMEs in DNS replies or AS numbers. Labovitz et at. [22] inferred that a small number of hosting infrastructures are responsi- ble for a significant fraction of inter-domain traffic. Unfortunately, this study observes only the traffic crossing AS boundaries, not traf- fic delivered directly from inside the monitored ISPs. As a conse- quence, important CDNs such as Akamai as well as data-centers deployed inside ISP networks are under-represented. In this paper, we introduce Web Content Cartography, i. e., the 585