International Statistical Review (2018), 86, 2, 322–343 doi:10.1111/insr.12253
Comparing Inference Methods for
Non-probability Samples
Bart Buelens
1
, Joep Burger
1
and Jan A. van den Brakel
1,2
1
Statistics Netherlands, P.O. Box 4481, 6401 CZ Heerlen, The Netherlands
2
Maastricht University School of Business and Economics, P.O. Box 616, 6200 MD Maastricht, The
Netherlands
E-mail: b.buelens@cbs.nl
Summary
Social and economic scientists are tempted to use emerging data sources like big data to compile
information about finite populations as an alternative for traditional survey samples. These data
sources generally cover an unknown part of the population of interest. Simply assuming that
analyses made on these data are applicable to larger populations is wrong. The mere volume of data
provides no guarantee for valid inference. Tackling this problem with methods originally developed
for probability sampling is possible but shown here to be limited. A wider range of model-based
predictive inference methods proposed in the literature are reviewed and evaluated in a simulation
study using real-world data on annual mileages by vehicles. We propose to extend this predictive
inference framework with machine learning methods for inference from samples that are generated
through mechanisms other than random sampling from a target population. Describing economies
and societies using sensor data, internet search data, social media and voluntary opt-in panels
is cost-effective and timely compared with traditional surveys but requires an extended inference
framework as proposed in this article.
Key words: Algorithmic inference; big data; predictive modelling; pseudo-design-based estimation.
1 Introduction
Evidence-based policymaking is nourished by accurate statistical information about social
and economic developments of societies. The emergence of large amounts of data as a
by-product of human and economic activity—big data—yields potentially new sources of infor-
mation (Ginsberg et al., 2009; Daas et al., 2015). Examples of big data are internet search
behaviour, social media, the internet of things, retail scanner data, electronic funds transfers,
Global Positioning System (GPS) and other sensor data. Compared with probability sam-
pling, big data typically contain much more data; they are timelier, cheaper and may be less
susceptible to measurement error; and they do not cause administrative burden. They also pro-
vide a major methodological challenge, as they do not necessarily completely cover the target
population (Bere ¸sewicz, 2017). The century-old legacy of sampling theory does not apply to
big data.
Until the beginning of the 20th century, a complete enumeration of the population was
deemed necessary to obtain valid quantitative information about finite populations. In the first
half of the 20th century, probability sampling was developed to draw valid inference from
International Statistical Review
© 2018 The Authors. International Statistical Review © 2018 International Statistical Institute. Published by John Wiley & Sons Ltd, 9600 Garsington
Road, Oxford OX4 2DQ, UK and 350 Main Street, Malden, MA 02148, USA.