Quantile Sampling for Practical Delay Monitoring in Internet Backbone Networks Baek-Young Choi a,* , Sue Moon b , Rene Cruz c , Zhi-Li Zhang d , Christophe Diot e a University of Missouri, Kansas City, MO, USA b Korea Advanced Institute of Science and Technology, Daejeon, Korea c University of California, San Diego, CA, USA d University of Minnesota, Twin Cities, MN, USA e Thomson Research, Paris, France Abstract Point-to-point delay is an important network performance measure as it captures service degradations caused by various events. We study how to measure and report delay in a concise and meaningful way for an ISP, and how to monitor it efficiently. We analyze various measurement intervals and potential metric definitions. We find that reporting high quantiles (between 0.95 and 0.99) every 10-30 minutes as the most effective way to summarize the delay in an ISP. We then propose an active probing scheme to estimate a high quantile with bounded error. We show that only a small number of probes are sufficient to provide an accurate estimate. We validate the proposed delay monitoring technique on real data collected on the Sprint IP backbone network. To make our work complete, we lastly compare the overhead of our active probing technique with a passive sampling scheme and show that for delay measurement, active probing is more practical. Key words: Delay, Performance monitoring, Active probing 1. Introduction Point-to-point delay is a powerful “network health” indicator in a backbone network. It captures service degradation due to congestion, link failure, and routing anomalies. Obtaining meaningful and accurate delay information is necessary for both ISPs and their customers. Thus delay has been used as a key parameter in Service Level Agreements (SLAs) between an ISP and its customers [12, 33]. In this paper, we systematically study how to mea- * Corresponding author. tel.:+1 816 235-2750; fax: +1 816 235 5159. Email address: choiby@umkc.edu (Baek-Young Choi). sure and report delay in a concise and meaningful way for an ISP, and how to monitor it efficiently. Operational experience suggests that the de- lay metric should report the delay experienced by most packets in the network, capture anomalous changes, and not be sensitive to statistical outliers such as packets with options and transient routing loops [3, 11]. The common practice in operational backbone networks is to use ping-like tools. ping measures network round trip times (RTTs) by send- ing ICMP requests to a target machine over a short period of time. However, ping was not designed as a delay measurement tool, but a reachability tool. Its reported delay includes uncertainties due to path asymmetry and ICMP packet generation times at routers. Furthermore, it is not clear how to Preprint submitted to Elsevier Science 1 December 2006