The successive projections algorithm for interval selection in PLS
☆
Adriano de Araújo Gomes
a
, Roberto Kawakami Harrop Galvão
b
, Mário Cesar Ugulino de Araújo
a
,
Germano Véras
c, 1
, Edvan Cirino da Silva
a,
⁎
a
Universidade Federal da Paraíba, CCEN, Departamento de Química, Caixa Postal 5093, CEP 58051-970, João Pessoa, PB, Brazil
b
Instituto Tecnológico de Aeronáutica, Divisão de Engenharia Eletrônica, CEP 12228-900, São José dos Campos, SP, Brazil
c
Universidade Estadual da Paraíba, CCT, Departamento de Química, 58.429-500, Campina Grande, PB, Brazil
abstract article info
Article history:
Received 27 November 2012
Received in revised form 4 March 2013
Accepted 17 March 2013
Available online 1 April 2013
Keywords:
Variable selection
iPLS
Successive projections algorithm
Partial Least Squares
NIR spectrometry
The successive projections algorithm (SPA) is aimed at selecting a subset of variables with small multi-
collinearity and suitable prediction power for use in Multiple Linear Regression (MLR). The resulting
SPA–MLR models have advantages in terms of simplicity and ease of interpretation as compared to
latent-variable models, such as Partial-Least-Squares (PLS). However, PLS tends to be less sensitive to instru-
mental noise. The present paper proposes an extension of SPA to combine the noise-reduction properties of
PLS with the possibility of discarding non-informative variables in SPA. For this purpose, SPA is modified in
order to select intervals of variables for use in PLS. The proposed iSPA–PLS algorithm is evaluated in two
case studies involving near-infrared spectrometric analysis of wheat and beer extract samples. As compared
to full-spectrum PLS, the resulting iSPA–PLS models exhibited better performance in terms of both
cross-validation and external prediction. On the other hand, iSPA–PLS and SPA–MLR presented similar
cross-validation performance, but the iSPA–PLS models clearly outperformed SPA–MLR in the external pre-
diction. Such results indicate that iSPA–PLS may be more robust with respect to differences between the ex-
ternal prediction set and the calibration set used in the cross-validation procedure.
© 2013 Elsevier B.V. All rights reserved.
1. Introduction
Modern analytical instruments have the ability of providing a
large amount of measured variables per analyzed sample within a
short time. However, in many cases not all of these variables are of
value to build a multivariate calibration model that relates the analyt-
ical signal with the parameter of interest. Within this scope, selection
techniques can be used to find a suitable subset of informative vari-
ables and thus obtain simpler models without compromising the pre-
diction ability [1–5]. Although this topic has been the subject of
extensive investigations, it is still a matter of much research in the lit-
erature [6,7].
Variable selection can be regarded as a combinatorial optimiza-
tion problem involving the minimization of a cost function related
to the analytical goal. In this sense, a variable selection strategy can
be characterized by the type of cost function, the constraints im-
posed on the combinations of variables, and the optimization algo-
rithm itself [8]. Different options for these three features have been
investigated in the literature, giving rise to several selection strate-
gies [9–12]. However this topic is still a matter of much research in
Chemometrics and related fields.
Within the scope of multivariate calibration, Araújo and collabo-
rators have proposed the Successive Projection Algorithm (SPA) for
selection of variables in Multiple Linear Regression (MLR) modeling
[13,14]. This algorithm is aimed at selecting a subset of variables
with small multi-collinearity and suitable prediction power. The
good results obtained by SPA–MLR in different analytical prob-
lems [15] motivated the extension of the algorithm to other fields
of Chemometrics, such as calibration transfer [16], classification
problem [17–19] and sample selection [20].
As compared to multivariate calibration methods based on latent
variables, such as Partial-Least-Squares (PLS), SPA–MLR models
have advantages in terms of simplicity and ease of interpretation.
Moreover, in some reported cases, SPA–MLR provided better predic-
tion results compared to PLS, which may be ascribed to the removal
of uninformative variables from the modeling process [21–23]. How-
ever, PLS models tend to be less sensitive to instrumental noise
because of the averaging process involved in the calculation of latent
variables from several redundant variables in the original domain
Microchemical Journal 110 (2013) 202–208
☆ Paper presented at 5th Ibero-American Congress of Analytical Chemistry 2012.
⁎ Corresponding author. Tel.: +55 83 3216 7438; fax: +55 83 3216 7437.
E-mail address: edvan@quimica.ufpb.br (E.C. da Silva).
1
Tel.: +55 83 3315 3356.
0026-265X/$ – see front matter © 2013 Elsevier B.V. All rights reserved.
http://dx.doi.org/10.1016/j.microc.2013.03.015
Contents lists available at SciVerse ScienceDirect
Microchemical Journal
journal homepage: www.elsevier.com/locate/microc