922 IEEE JOURNAL ON SELECTED AREAS IN COMMUNICATIONS, VOL. 21, NO. 6, AUGUST 2003 Measuring the Size of the Internet via Importance Sampling Song Xing and Bernd-Peter Paris Abstract—Measuring the size of the Internet via Monte Carlo sampling requires probing a large portion of the Internet protocol (IP) address space to obtain an accurate estimate. However, the dis- tribution of information servers on the Internet is highly nonuni- form over the IP address space. This allows us to design probing strategies based on importance sampling for measuring the preva- lence of an information service on the Internet that are significantly more effective than strategies relying on Monte Carlo sampling. We present thorough analysis of our strategies together with accu- rate estimates for the current size of the Internet Protocol Version 4 (IPv4) Internet as measured by the number of publicly accessible web servers and FTP servers. Index Terms—Importance sampling, Monte Carlo sampling, size of the Internet. I. INTRODUCTION A S COMPUTERS and communication networks have be- come faster and more widespread, the Internet has ex- perienced tremendous growth since its inception. Unlike the telephone network which was designed in a centralized way by major corporations, the Internet design emphasizes decentral- ized control. Though it is essential to the Internet’s scalability and robustness, the decentralization of control causes prob- lems that may hamper the evolution of the Internet, including unreliable service or nonoptimal routing. A more pernicious problem is that it is difficult to determine how large the Internet really is, i.e., to quantify exactly how many hosts are currently on the Internet. Therefore, it is difficult to estimate reliably the growth of the Internet and predict, for example, when the available Internet address will eventually run out. Hence, developing efficient means for assessing the size of the Internet, is of interest, for example, for network engineering or network capacity planning purposes. There are relatively few publications on measuring the size of the Internet. The Internet Software Consortium, for example, at- tempts to discover every host on the Internet by querying the do- main name system (DNS) [1]. The problem with this approach is that it is inaccurate since a host name with an assigned IP ad- dress does not mean the host actually exists. Conversely, a host does not have to be in the DNS to communicate, thus a second “ping” step may be needed to obtain the number of live hosts. This approach is also inefficient as it requires several days to Manuscript received August 18, 2002; revised March 5, 2003. The authors are with the Department of Electrical and Computer Engineering, George Mason University, Fairfax, VA 22030 USA (e-mail: sxing@gmu.edu; pparis@gmu.edu). Digital Object Identifier 10.1109/JSAC.2003.814510 collect data, and it may not be scalable as the Internet continues to grow. In fact, the survey conducted by the Internet Software Consortium may be well suited to take advantage of the methods described herein. Netcraft does a periodic survey of web server software usage on the Internet and the number of web servers [2]. Their statis- tics are obtained by collecting and collating the host names pro- viding the HTTP service, systematically polling each one with an HTTP request for the server name, and looking in detail at the network characteristics of the HTTP replies. Obviously, this approach is time-consuming collection of the data and the accu- racy of their survey depends on the number of data collected. In this work, we emphasize our importance-sampling based method over actual measurements. Nevertheless, to demonstrate the usefulness of our approach, we report our measurements of an important part of the current Internet. Specifically, we are measuring the number of hosts connected to the public Internet (hosts with a publicly routable IP address) providing a given in- formation service such as WWW or FTP. As will be explained below, our methods are based on sampling the Internet protocol (IP) address space. Hence, our methods have their own short- comings, including an inability to distinguish between multiple web domains hosted by the same server (virtual hosting). Sim- ilarly, we would not be able to tell that a system of servers employing some form of load balancing should probably be counted as only a single server. Because of these differences, it should be expected that our results are quite different from those obtained by Netcraft [2] for example. The primary strengths of the methods proposed herein are simplicity, wide applicability, and scalability. The sampling based strategies consist only of an address generator that determines which IP addresses are to be probed, the probing client itself, and a simple analysis system for tallying the results of the probes. Our methods are widely applicable to network applications following the client-server paradigm. For each such application, only the probing client would have to be altered. The results could be used to track the prevalence and growth of a network application or the rate of adoption of a new protocol. Similarly, if probes employ some form of echo request the size of the entire public Internet may be measured. Perhaps, most importantly, we believe that our methods are able to keep up with the continued explosive growth of the Internet. Since importance-sampling allows us to focus measurements on the most relevant part of the address space, we anticipate that measurement methods based on importance sampling will scale with the size of the address space. This paper principally proposes and investigates novel, effi- cient and effective methods based on importance sampling for 0733-8716/03$17.00 © 2003 IEEE