Estimation and potential improvement of the quality of legacy soil samples for digital soil mapping F. Carré a, , Alex B. McBratney b , B. Minasny b a European Commission, DG Joint Research Centre, Institute of Environment and Sustainability, Land Management Unit, TP 280, 21020 Ispra (Va), Italy b Australian Centre for Precision Agriculture, Faculty of Agriculture, Food & Natural Resources, The University of Sydney, NSW 2006, Australia Received 19 January 2006; received in revised form 24 August 2006; accepted 24 January 2007 Available online 29 May 2007 Abstract Legacy soil data form an important resource for digital soil mapping and are essential for calibration of models for predicting soil properties from environmental variables. Such data arise from traditional soil survey. Methods of soil survey are generally empirical and based on the mental development of the surveyor, correlating soil with underlying geology, landforms, vegetation and air-photo interpretation. There are no statistical criteria for traditional soil sampling, and this may lead to biases in the areas being sampled. The challenge is to test the use of legacy data for large- area mapping (e.g. national or continental extents) in order to limit the funds of field survey for large-area mapping. The problem is then to assess the reliability and quality of the legacy soil databases that have been mainly populated by traditional soil survey, and if there is a possibility of additional funding for sampling, to determine where new sampling units should be located. This additional sampling can be used to improve and validate the prediction model. Latin hypercube sampling (LHS) has been proposed as a sampling design for digital soil mapping when there is no prior sample. We use the principle of hypercube sampling to assess the quality of existing soil data and guide us to locations that need to be sampled. First an area is defined and the empirical environmental data layers or covariates are identified on a regular grid. The existing soil data are matched with the environmental variables. The HELS algorithm is used to check the occupancy of the legacy sampling units in the hypercube of the quantiles of the covarying environmental data. This is to determine whether legacy soil survey data occupy the hypercube uniformly or if there is over- or under-observation in the partitions of the hypercube. It also allows posterior estimation of the apparent probability of sample units being surveyed. From this information we can design further sampling. The methods are illustrated using legacy soil samples from Edgeroi, New South Wales, Australia, and from a large part of the Danube Basin. One third of the total number of sampling units are added to the original dataset. These new sampling units improve the representation of the feature space of the covariate. The standard deviation of the overall density is consequently smaller. © 2007 Published by Elsevier B.V. Keywords: Legacy soil data; Soil sampling; Hypercube sampling; Pedometrics; Soil survey; Digital soil mapping 1. Introduction Legacy soil data arise from traditional soil survey (Bui and Moran, 2001). Methods of soil survey are generally empirical and based on the mental development of the surveyor, cor- relating soil with underlying geology, landforms, vegetation and air-photo interpretation. There are no statistical criteria for traditional soil sampling, this may lead to bias in the areas being sampled. de Gruijter et al. (2006) offer some very thoughtful defi- nitions in relation to sampling which we paraphrase here and use subsequently. Sampling sensu lato comprises selecting parts from a universe with the purpose of taking observations on them. The selected parts may be observed in situ, or material may be taken out from them for subsequent measurement in a laboratory. It is the collection of selected parts that is referred to as the sample. A single part that is, or could be, selected, is referred to as a sampling unit. The total number of sampling Geoderma 141 (2007) 1 14 www.elsevier.com/locate/geoderma Corresponding author. Tel.: +39 0332 78 65 46; fax: +39 0332 78 63 94. E-mail address: Florence.Carre@jrc.it (F. Carré). 0016-7061/$ - see front matter © 2007 Published by Elsevier B.V. doi:10.1016/j.geoderma.2007.01.018