Cross-sectional Markov model for forecasting population characteristics Agnieszka Werpachowska * and Roman Werpachowski London, United Kingdom Abstract We present a stochastic model of population dynamics exploiting cross-sectional data in trend analysis and forecasts for groups and cohorts of a population. While sharing the convenient features of classic Markov models, it alleviates the practical problems experienced in longitudinal studies. Based on statistical and information-theoretical analysis, we adopt maximum likelihood estimation to determine model parameters, facilitating the use of a range of model selection methods. Their application to several synthetic and empirical datasets shows that the proposed approach is robust, stable and superior to a regression-based one. We extend the basic framework to simulate ageing cohorts, processes with ﬁnite memory, distinguishing their short and long-term trends, introduce regularisation to avoid the ecological fallacy, and generalise it to mixtures of cross-sectional and (possibly incomplete) longitudinal data. The presented model illustrations yield new and interesting results, such as an implied common driving factor in obesity for all generations of the English population and “yo-yo” dieting in the U.S. data. Keywords: cross-sectional data, longitudinal data, pooled data, Markov model, forecasting, BMI, marijuana 1 Introduction The abundance of statistical surveys and censuses from past years invites new enhanced methods for studying various aspects of the composition and dynamics of populations. Gathered in diﬀer- ent forms, as cross-sectional or longitudinal data, they provide information on large, independent or overlapping, sets of subjects drawn from a population and observed at several points in time. The ﬁrst presents a snapshot of the population for quantitative and comparative analysis, while the latter tracks selected individuals, facilitating cohort and causal inferences. The cross-sectional data is often regarded inferior to the longitudinal one as it does not capture mechanisms under- pinning observed eﬀects. At the same time, however, it is oblivious to such problems as attrition, conditioning or response bias, while its much cheaper and faster collection procedure does not raise concerns about the conﬁdentiality and data protection legislation. For these reasons, it is tempting to search for ways of employing it in the longitudinal analysis. Making inferences about the population dynamics on the basis of severed longitudinal informa- tion gleaned from cross-sectional data requires suitable theoretical approach and modelling tools. Several methods proposed, e.g. [1–18], are essentially based on regression techniques, concern co- hort studies or special cases of modelling repeated cross-sectional data using Markov models. In this paper we present a generally applicable cross-sectional Markov (CSM) model for the transi- tion analysis of survey data exploiting information from cross-sectional samples. While sharing the attractive features of classic Markov models, it avoids the practical problems associated with longitudinal data, and—due to its focus on population transfer rates between discrete states—it * a.m.werpachowska@gmail.com 1