International Statistical Review (2018), 86, 2, 322–343 doi:10.1111/insr.12253 Comparing Inference Methods for Non-probability Samples Bart Buelens 1 , Joep Burger 1 and Jan A. van den Brakel 1,2 1 Statistics Netherlands, P.O. Box 4481, 6401 CZ Heerlen, The Netherlands 2 Maastricht University School of Business and Economics, P.O. Box 616, 6200 MD Maastricht, The Netherlands E-mail: b.buelens@cbs.nl Summary Social and economic scientists are tempted to use emerging data sources like big data to compile information about finite populations as an alternative for traditional survey samples. These data sources generally cover an unknown part of the population of interest. Simply assuming that analyses made on these data are applicable to larger populations is wrong. The mere volume of data provides no guarantee for valid inference. Tackling this problem with methods originally developed for probability sampling is possible but shown here to be limited. A wider range of model-based predictive inference methods proposed in the literature are reviewed and evaluated in a simulation study using real-world data on annual mileages by vehicles. We propose to extend this predictive inference framework with machine learning methods for inference from samples that are generated through mechanisms other than random sampling from a target population. Describing economies and societies using sensor data, internet search data, social media and voluntary opt-in panels is cost-effective and timely compared with traditional surveys but requires an extended inference framework as proposed in this article. Key words: Algorithmic inference; big data; predictive modelling; pseudo-design-based estimation. 1 Introduction Evidence-based policymaking is nourished by accurate statistical information about social and economic developments of societies. The emergence of large amounts of data as a by-product of human and economic activity—big data—yields potentially new sources of infor- mation (Ginsberg et al., 2009; Daas et al., 2015). Examples of big data are internet search behaviour, social media, the internet of things, retail scanner data, electronic funds transfers, Global Positioning System (GPS) and other sensor data. Compared with probability sam- pling, big data typically contain much more data; they are timelier, cheaper and may be less susceptible to measurement error; and they do not cause administrative burden. They also pro- vide a major methodological challenge, as they do not necessarily completely cover the target population (Bere ¸sewicz, 2017). The century-old legacy of sampling theory does not apply to big data. Until the beginning of the 20th century, a complete enumeration of the population was deemed necessary to obtain valid quantitative information about finite populations. In the first half of the 20th century, probability sampling was developed to draw valid inference from International Statistical Review © 2018 The Authors. International Statistical Review © 2018 International Statistical Institute. Published by John Wiley & Sons Ltd, 9600 Garsington Road, Oxford OX4 2DQ, UK and 350 Main Street, Malden, MA 02148, USA.