Reconstructing past biomes states using machine learning and modern pollen assemblages: A case study from Southern Africa Magdalena K. Sobol a, * , Louis Scott b , Sarah A. Finkelstein a a Department of Earth Sciences, University of Toronto, 22 Russell St, Toronto, M5S 3B1, Canada b Department of Plant Sciences, University of the Free State, PO Box 339, Bloemfontein, 9300, South Africa article info Article history: Received 27 September 2018 Received in revised form 28 February 2019 Accepted 25 March 2019 Available online 4 April 2019 Keywords: Pollen datasets Data analysis Objective classiﬁcation Biomes Vegetation dynamics Vegetation reconstructions Late Pleistocene Holocene abstract Fossil pollen assemblages can assist in understanding biome responses to global climate change if there is reasonable probability that they represent speciﬁc biomes or bioregions. In this paper, we introduce a novel probabilistic presentation of pollen data and biome assignment. We apply a recently developed pollen-based vegetation classiﬁcation method utilizing supervised machine learning to Southern Africa modern pollen assemblages. We present an updated modern pollen dataset from Southern Africa, linking the sites to previously deﬁned vegetation units and, ultimately, we generate probabilistic classiﬁcation for fossil assemblages to reconstruct past vegetation. The modern pollen dataset (N ¼ 211 sites) represents a long vegetation gradient, from desert to forest biomes, capturing broad climate gradients ranging from arid to subtropical. We validate two models using Random Forest algorithm to classify modern vegetation at different spatial resolutions: subcon- tinental (biomes) and regional (bioregions). When the modern pollen assemblages (N ¼ 164 sites) are used to predict the vegetation types, the classiﬁcation models are correct in a number of cases. In our dataset of 164 sites, the classiﬁcation model correctly classiﬁes pollen assemblages from savanna (91% correct), grassland (87%), and coastal forest (82%) vegetation types, while the best results for classiﬁ- cation of regional vegetation are achieved for sub-humid savanna (95%), dry savanna (95%), coastal forest (91%), and wet grassland (90%). We apply the models to a fossil pollen sequence at Wonderkrater in the South African savanna, to reconstruct subcontinental and regional changes in past vegetation states over the last 60 000 years. The most probable vegetation state dominating the region since the Late Pleistocene is sub-humid savanna yet grassland occurred at times associated with high vegetation variability. Within the record, the most frequent and ampliﬁed variability in the inferred vegetation states occurred during the transitional phase between the Late Pleistocene and the Holocene. The machine learning approach for reconstructing past vegetation, offers a more complex and nuanced view of past vegetation dynamics and has the potential to support quantitative proxy-based techniques for palaeoclimatic reconstructions. © 2019 Published by Elsevier Ltd. 1. Introduction Modeling complex biomes using multivariate proxy data is ecologically challenging and computationally expensive. At the time of its development, the biomization method (Prentice et al., 1992, 1996) was a revolutionary approach to modeling biomes from pollen data. The method rests on the assumption that the functional relationship between form and function of few key plants may be substituted for biomes and biome modeling. Biomes are classiﬁed using pollen assemblages through plant functional types (PFTs); to link pollen assemblages to biomes, two binary matrices assigning pollen assemblages to PFTs, and PTFs to biomes are multiplied (Prentice et al., 1992). Thus, the biomization method reduces large complex datasets to a smaller number of represen- tative plants. The method, however, relies on few key pollen taxa to represent biomes. Methods considering whole pollen datasets may provide additional nuances particularly applicable to periods of high cli- matic variability (Williams et al., 2004). Moreover, PFTs created for one region cannot be easily applied to another region. Thus, new PFTs must be created for new contexts. Methodological biases can * Corresponding author. E-mail address: magdalena.sobol@mail.utoronto.ca (M.K. Sobol). Contents lists available at ScienceDirect Quaternary Science Reviews journal homepage: www.elsevier.com/locate/quascirev https://doi.org/10.1016/j.quascirev.2019.03.027 0277-3791/© 2019 Published by Elsevier Ltd. Quaternary Science Reviews 212 (2019) 1e17