922 IEEE JOURNAL ON SELECTED AREAS IN COMMUNICATIONS, VOL. 21, NO. 6, AUGUST 2003
Measuring the Size of the Internet
via Importance Sampling
Song Xing and Bernd-Peter Paris
Abstract—Measuring the size of the Internet via Monte Carlo
sampling requires probing a large portion of the Internet protocol
(IP) address space to obtain an accurate estimate. However, the dis-
tribution of information servers on the Internet is highly nonuni-
form over the IP address space. This allows us to design probing
strategies based on importance sampling for measuring the preva-
lence of an information service on the Internet that are significantly
more effective than strategies relying on Monte Carlo sampling.
We present thorough analysis of our strategies together with accu-
rate estimates for the current size of the Internet Protocol Version
4 (IPv4) Internet as measured by the number of publicly accessible
web servers and FTP servers.
Index Terms—Importance sampling, Monte Carlo sampling, size
of the Internet.
I. INTRODUCTION
A
S COMPUTERS and communication networks have be-
come faster and more widespread, the Internet has ex-
perienced tremendous growth since its inception. Unlike the
telephone network which was designed in a centralized way by
major corporations, the Internet design emphasizes decentral-
ized control. Though it is essential to the Internet’s scalability
and robustness, the decentralization of control causes prob-
lems that may hamper the evolution of the Internet, including
unreliable service or nonoptimal routing.
A more pernicious problem is that it is difficult to determine
how large the Internet really is, i.e., to quantify exactly how
many hosts are currently on the Internet. Therefore, it is difficult
to estimate reliably the growth of the Internet and predict, for
example, when the available Internet address will eventually run
out. Hence, developing efficient means for assessing the size of
the Internet, is of interest, for example, for network engineering
or network capacity planning purposes.
There are relatively few publications on measuring the size of
the Internet. The Internet Software Consortium, for example, at-
tempts to discover every host on the Internet by querying the do-
main name system (DNS) [1]. The problem with this approach
is that it is inaccurate since a host name with an assigned IP ad-
dress does not mean the host actually exists. Conversely, a host
does not have to be in the DNS to communicate, thus a second
“ping” step may be needed to obtain the number of live hosts.
This approach is also inefficient as it requires several days to
Manuscript received August 18, 2002; revised March 5, 2003.
The authors are with the Department of Electrical and Computer Engineering,
George Mason University, Fairfax, VA 22030 USA (e-mail: sxing@gmu.edu;
pparis@gmu.edu).
Digital Object Identifier 10.1109/JSAC.2003.814510
collect data, and it may not be scalable as the Internet continues
to grow. In fact, the survey conducted by the Internet Software
Consortium may be well suited to take advantage of the methods
described herein.
Netcraft does a periodic survey of web server software usage
on the Internet and the number of web servers [2]. Their statis-
tics are obtained by collecting and collating the host names pro-
viding the HTTP service, systematically polling each one with
an HTTP request for the server name, and looking in detail at
the network characteristics of the HTTP replies. Obviously, this
approach is time-consuming collection of the data and the accu-
racy of their survey depends on the number of data collected.
In this work, we emphasize our importance-sampling based
method over actual measurements. Nevertheless, to demonstrate
the usefulness of our approach, we report our measurements of
an important part of the current Internet. Specifically, we are
measuring the number of hosts connected to the public Internet
(hosts with a publicly routable IP address) providing a given in-
formation service such as WWW or FTP. As will be explained
below, our methods are based on sampling the Internet protocol
(IP) address space. Hence, our methods have their own short-
comings, including an inability to distinguish between multiple
web domains hosted by the same server (virtual hosting). Sim-
ilarly, we would not be able to tell that a system of servers
employing some form of load balancing should probably be
counted as only a single server. Because of these differences,
it should be expected that our results are quite different from
those obtained by Netcraft [2] for example.
The primary strengths of the methods proposed herein are
simplicity, wide applicability, and scalability. The sampling
based strategies consist only of an address generator that
determines which IP addresses are to be probed, the probing
client itself, and a simple analysis system for tallying the results
of the probes. Our methods are widely applicable to network
applications following the client-server paradigm. For each
such application, only the probing client would have to be
altered. The results could be used to track the prevalence and
growth of a network application or the rate of adoption of a
new protocol. Similarly, if probes employ some form of echo
request the size of the entire public Internet may be measured.
Perhaps, most importantly, we believe that our methods are able
to keep up with the continued explosive growth of the Internet.
Since importance-sampling allows us to focus measurements
on the most relevant part of the address space, we anticipate
that measurement methods based on importance sampling will
scale with the size of the address space.
This paper principally proposes and investigates novel, effi-
cient and effective methods based on importance sampling for
0733-8716/03$17.00 © 2003 IEEE