Journal of Hazardous Materials 167 (2009) 615–624 Contents lists available at ScienceDirect Journal of Hazardous Materials journal homepage: www.elsevier.com/locate/jhazmat Classification and regression trees (CARTs) for modelling the sorption and retention of heavy metals by soil F.A. Vega a , J.M. Matías b , M.L. Andrade a , M.J. Reigosa a , E.F. Covelo a, a Departamento de Bioloxía Vexetal e Ciencias do Solo, Universidade de Vigo, Spain b Departamento de Estatística, Universidade de Vigo, Spain article info Article history: Received 2 October 2008 Received in revised form 2 January 2009 Accepted 6 January 2009 Available online 16 January 2009 Keywords: CART Regression trees Soil Sorption Retention Heavy metal Soil characteristics abstract The sorption and retention of mixtures of heavy metals by soil is a complex process that depends on both soil properties and competition between metals for sorption sites. In this study, the sorption and retention of mixtures of Cd, Cr, Pb, Cu, Zn and Ni by a representative sample of soils from Galicia (N.W. Spain) was reproduced considerably more precisely by binary decision-tree regression models constructed using the CART algorithm than by linear regression models. Of the six metals competing for sorption sites in these experiments, Pb, Cu and Cr were sorbed and retained to a greater extent than Cd, Ni and Zn. Non-linear tree regression models constructed with CART fitted the data better than linear models, especially for Cd, Ni and Zn; and with both kinds of model the data for Pb, Cu and Cr were fitted better than those for Cd, Ni and Zn (the difference being much more marked for linear models), suggesting that the influence of soil properties on the sorption and retention of the latter three metals was limited by the preferential binding of the former three. © 2009 Elsevier B.V. All rights reserved. 1. Introduction Numerous land treatments and other practices, including the application of fertilizer or sewage sludge, the disposal of wastew- ater on land, and industrial activity, can lead to soils accumulating heavy metal contents substantially in excess of natural levels, with the consequent risk of uptake by plants, pollution of surface or underground waters, and propagation through the food chain [1]. The risk of leaching or uptake by plants depends on the concen- tration of pollutant in the soil solution, which in turn depends on the sorption–desorption equilibria that govern the partition of pol- lutant between soil solution and soil solids, soil colloids especially [2,3]. The toxic potential of heavy metals in soil thus depends on soil composition, particularly the amount and type of clay min- erals [4–6], organic matter [7,8] and iron and manganese oxides [9–11]. In keeping with the above, in previous work we found that the sorption and desorption of heavy metals by certain soils in Gali- cia (N.W. Spain) is determined mainly by organic matter, Fe and Mn oxides, and clay and mica contents [12–14]. However, sorption and desorption isotherms have irregular profiles presumably due to competition among metals for sorption sites, and the dependence Corresponding author. Tel.: +34 986 812630; fax: +34 986 812556. E-mail address: emmaf@uvigo.es (E.F. Covelo). of sorption and desorption on soil properties is only moderately well represented by linear models [15,16]. A methodology that is gaining favour in an increasingly broad variety of fields for modelling non-linear processes and structures is the use of decision trees, generalizations of the familiar botanical key. When it is a regression model that is needed rather than a clas- sifier, i.e., when the dependent variable Y is a continuous random variable with conditional distribution. Y |x = f (x) + ε x = (x 1 ,...,x n ) for some zero-mean random error ε and the problem is to esti- mate the regression function f(x), these methods effectively divide the space X = {x}, in which the random predictor variables X i take their values, into a finite number m of disjoint hyper-rectangles D k that together cover X, and approximate f(x) by a piecewise constant function ˆ f (x) = ˙ m 1 ˆ c k 1 k , (1) where 1 k is the indicator function of D k (1 k (x) = 1 if x D k ,1 k (x)=0 if x / D k ) and ˆc k is an estimate of the mean of Y in D k (in practice, the sample mean). The problem is to define the D k . The regres- sion tree approach (decision-tree regression) does this in successive steps, creating a tree of nested hyper-rectangles D (i) k (the nodes of the tree), the lowest members of which (the “leaves”) are the final D k . To avoid overfitting the model, this tree may then be “pruned back”, a process analogous to backwards elimination of 0304-3894/$ – see front matter © 2009 Elsevier B.V. All rights reserved. doi:10.1016/j.jhazmat.2009.01.016