Mapping numerically classied soil taxa in Kilombero Valley, Tanzania using machine learning Boniface H.J. Massawe a,b, , Sakthi K. Subburayalu a , Abel K. Kaaya b , Leigh Winowiecki c , Brian K. Slater a a School of Environment and Natural Resources, The Ohio State University, 210 Kottman Hall, 2021 Coffey Road, Columbus, OH 43210, USA b Department of Soil and Geological Sciences, Sokoine University of Agriculture, PO Box 3008, Morogoro, Tanzania c World Agroforestry Centre, United Nations Avenue, Gigiri, Nairobi, Kenya abstract article info Article history: Received 21 March 2016 Received in revised form 11 November 2016 Accepted 14 November 2016 Available online xxxx Inadequacy of spatial soil information is one of the limiting factors to making evidence-based decisions to im- prove food security and land management in the developing countries. Various digital soil mapping (DSM) tech- niques have been applied in many parts of the world to improve availability and usability of soil data, but less has been done in Africa, particularly in Tanzania and at the scale necessary to make farm management decisions. The Kilombero Valley has been identied for intensied rice production. However the valley lacks detailed and up-to- date soil information for decision-making. The overall objective of this study was to develop a predictive soil map of a portion of Kilombero Valley using DSM techniques. Two widely used decision tree algorithms and three sources of Digital Elevation Models (DEMs) were evaluated for their predictive ability. Firstly, a numerical classi- cation was performed on the collected soil prole data to arrive at soil taxa. Secondly, the derived taxa were spa- tially predicted and mapped following SCORPAN framework using Random Forest (RF) and J48 machine learning algorithms. Datasets to train the model were derived from legacy soil map, RapidEye satellite image and three DEMs: 1 arc SRTM, 30 m ASTER, and 12 m WorldDEM. Separate predictive models were built using each DEM source. Mapping showed that RF was less sensitive to the training set sampling intensity. Results also showed that predictions of soil taxa using 1 arc SRTM and 12 m WordDEM were identical. We suggest the use of RF algo- rithm and the freely available SRTM DEM combination for mapping the soils for the whole Kilombero Valley. This combination can be tested and applied in other areas which have relatively at terrain like the Kilombero Valley. © 2016 Elsevier B.V. All rights reserved. Keywords: Kilombero Valley Numerical classication Machine learning Soil mapping Decision tree analysis DEM 1. Introduction The Kilombero Valley in Tanzania presents great potential for the ex- pansion and intensication of rice production. This valley, covering an area of about 11,600 km 2 (Kato, 2007), has been identied by the Gov- ernment of Tanzania for nancial and technological investments to ex- pand and intensify rice production (TIC, 2013). Rice is the second most important cereal crop in Tanzania after maize (Bucheyeki et al., 2011), and its demand has been increasing following shift in preference by local population from traditional staples to rice, and increased mar- ket demands from neighboring countries. To develop and promote sus- tainable rice production intensication; farmers and policy makers need to identify the most suitable areas and respective management options. However, updated and detailed soil information to this support deci- sion-making process is currently lacking. Accurate soil information is crucial for informing management rec- ommendations aimed to increase agricultural productivity and overall food security, especially in developing countries where the GDP is heavily dependent on the agricultural sector (Cook et al., 2008; Msanya et al., 2002). Relatively longer time is required to gather such in- formation through conventional soil inventory and generally, larger amount of resources are required for such exercises (McBratney et al., 2003). Recent developments in remote and proximal sensing, computa- tional methods and information technology, have provided means by which soil information can be collected, shared, communicated and up- dated more efciently (Malone, 2013; McBratney et al., 2003; Scull et al., 2003; Vågen et al., 2013; Vågen et al., 2016; Winowiecki et al., 2016a, 2016b). Predictive soil landscape model frameworks such as the SCORPAN approach (McBratney et al., 2003) could be used to pre- dict continuous soil classes and soil attributes that better represent soil spatial variability. The increased availability of high resolution digi- tal elevation models (DEMs) that provide predictive variables in digital soil mapping together with the advances in machine learning tech- niques add to the ease of generating spatial soil information and depicting uncertainty (Hansen et al., 2009; Haring et al., 2012; Subburayalu and Slater, 2013; Subburayalu et al., 2014). Geoderma xxx (2016) xxxxxx Corresponding author at: Department of Soil and Geological Sciences, Sokoine University of Agriculture, PO Box 3008, Morogoro, Tanzania. E-mail addresses: bonmass@yahoo.com (B.H.J. Massawe), L.A.WINOWIECKI@CGIAR.ORG (L. Winowiecki). GEODER-12542; No of Pages 6 http://dx.doi.org/10.1016/j.geoderma.2016.11.020 0016-7061/© 2016 Elsevier B.V. All rights reserved. Contents lists available at ScienceDirect Geoderma journal homepage: www.elsevier.com/locate/geoderma Please cite this article as: Massawe, B.H.J., et al., Mapping numerically classied soil taxa in Kilombero Valley, Tanzania using machine learning, Geoderma (2016), http://dx.doi.org/10.1016/j.geoderma.2016.11.020