Journal of Hazardous Materials 167 (2009) 615–624
Contents lists available at ScienceDirect
Journal of Hazardous Materials
journal homepage: www.elsevier.com/locate/jhazmat
Classification and regression trees (CARTs) for modelling the sorption and
retention of heavy metals by soil
F.A. Vega
a
, J.M. Matías
b
, M.L. Andrade
a
, M.J. Reigosa
a
, E.F. Covelo
a,∗
a
Departamento de Bioloxía Vexetal e Ciencias do Solo, Universidade de Vigo, Spain
b
Departamento de Estatística, Universidade de Vigo, Spain
article info
Article history:
Received 2 October 2008
Received in revised form 2 January 2009
Accepted 6 January 2009
Available online 16 January 2009
Keywords:
CART
Regression trees
Soil
Sorption
Retention
Heavy metal
Soil characteristics
abstract
The sorption and retention of mixtures of heavy metals by soil is a complex process that depends on both
soil properties and competition between metals for sorption sites. In this study, the sorption and retention
of mixtures of Cd, Cr, Pb, Cu, Zn and Ni by a representative sample of soils from Galicia (N.W. Spain) was
reproduced considerably more precisely by binary decision-tree regression models constructed using the
CART algorithm than by linear regression models.
Of the six metals competing for sorption sites in these experiments, Pb, Cu and Cr were sorbed and
retained to a greater extent than Cd, Ni and Zn. Non-linear tree regression models constructed with CART
fitted the data better than linear models, especially for Cd, Ni and Zn; and with both kinds of model the
data for Pb, Cu and Cr were fitted better than those for Cd, Ni and Zn (the difference being much more
marked for linear models), suggesting that the influence of soil properties on the sorption and retention
of the latter three metals was limited by the preferential binding of the former three.
© 2009 Elsevier B.V. All rights reserved.
1. Introduction
Numerous land treatments and other practices, including the
application of fertilizer or sewage sludge, the disposal of wastew-
ater on land, and industrial activity, can lead to soils accumulating
heavy metal contents substantially in excess of natural levels, with
the consequent risk of uptake by plants, pollution of surface or
underground waters, and propagation through the food chain [1].
The risk of leaching or uptake by plants depends on the concen-
tration of pollutant in the soil solution, which in turn depends on
the sorption–desorption equilibria that govern the partition of pol-
lutant between soil solution and soil solids, soil colloids especially
[2,3]. The toxic potential of heavy metals in soil thus depends on
soil composition, particularly the amount and type of clay min-
erals [4–6], organic matter [7,8] and iron and manganese oxides
[9–11].
In keeping with the above, in previous work we found that the
sorption and desorption of heavy metals by certain soils in Gali-
cia (N.W. Spain) is determined mainly by organic matter, Fe and
Mn oxides, and clay and mica contents [12–14]. However, sorption
and desorption isotherms have irregular profiles presumably due to
competition among metals for sorption sites, and the dependence
∗
Corresponding author. Tel.: +34 986 812630; fax: +34 986 812556.
E-mail address: emmaf@uvigo.es (E.F. Covelo).
of sorption and desorption on soil properties is only moderately
well represented by linear models [15,16].
A methodology that is gaining favour in an increasingly broad
variety of fields for modelling non-linear processes and structures
is the use of decision trees, generalizations of the familiar botanical
key. When it is a regression model that is needed rather than a clas-
sifier, i.e., when the dependent variable Y is a continuous random
variable with conditional distribution.
Y
|x
= f (x) + ε x = (x
1
,...,x
n
)
for some zero-mean random error ε and the problem is to esti-
mate the regression function f(x), these methods effectively divide
the space X = {x}, in which the random predictor variables X
i
take
their values, into a finite number m of disjoint hyper-rectangles D
k
that together cover X, and approximate f(x) by a piecewise constant
function
ˆ f (x) = ˙
m
1
ˆ c
k
1
k
, (1)
where 1
k
is the indicator function of D
k
(1
k
(x) = 1 if x ∈ D
k
,1
k
(x)=0
if x / ∈ D
k
) and ˆc
k
is an estimate of the mean of Y in D
k
(in practice,
the sample mean). The problem is to define the D
k
. The regres-
sion tree approach (decision-tree regression) does this in successive
steps, creating a tree of nested hyper-rectangles D
(i)
k
(the nodes
of the tree), the lowest members of which (the “leaves”) are the
final D
k
. To avoid overfitting the model, this tree may then be
“pruned back”, a process analogous to backwards elimination of
0304-3894/$ – see front matter © 2009 Elsevier B.V. All rights reserved.
doi:10.1016/j.jhazmat.2009.01.016