The successive projections algorithm for interval selection in PLS Adriano de Araújo Gomes a , Roberto Kawakami Harrop Galvão b , Mário Cesar Ugulino de Araújo a , Germano Véras c, 1 , Edvan Cirino da Silva a, a Universidade Federal da Paraíba, CCEN, Departamento de Química, Caixa Postal 5093, CEP 58051-970, João Pessoa, PB, Brazil b Instituto Tecnológico de Aeronáutica, Divisão de Engenharia Eletrônica, CEP 12228-900, São José dos Campos, SP, Brazil c Universidade Estadual da Paraíba, CCT, Departamento de Química, 58.429-500, Campina Grande, PB, Brazil abstract article info Article history: Received 27 November 2012 Received in revised form 4 March 2013 Accepted 17 March 2013 Available online 1 April 2013 Keywords: Variable selection iPLS Successive projections algorithm Partial Least Squares NIR spectrometry The successive projections algorithm (SPA) is aimed at selecting a subset of variables with small multi- collinearity and suitable prediction power for use in Multiple Linear Regression (MLR). The resulting SPAMLR models have advantages in terms of simplicity and ease of interpretation as compared to latent-variable models, such as Partial-Least-Squares (PLS). However, PLS tends to be less sensitive to instru- mental noise. The present paper proposes an extension of SPA to combine the noise-reduction properties of PLS with the possibility of discarding non-informative variables in SPA. For this purpose, SPA is modied in order to select intervals of variables for use in PLS. The proposed iSPAPLS algorithm is evaluated in two case studies involving near-infrared spectrometric analysis of wheat and beer extract samples. As compared to full-spectrum PLS, the resulting iSPAPLS models exhibited better performance in terms of both cross-validation and external prediction. On the other hand, iSPAPLS and SPAMLR presented similar cross-validation performance, but the iSPAPLS models clearly outperformed SPAMLR in the external pre- diction. Such results indicate that iSPAPLS may be more robust with respect to differences between the ex- ternal prediction set and the calibration set used in the cross-validation procedure. © 2013 Elsevier B.V. All rights reserved. 1. Introduction Modern analytical instruments have the ability of providing a large amount of measured variables per analyzed sample within a short time. However, in many cases not all of these variables are of value to build a multivariate calibration model that relates the analyt- ical signal with the parameter of interest. Within this scope, selection techniques can be used to nd a suitable subset of informative vari- ables and thus obtain simpler models without compromising the pre- diction ability [15]. Although this topic has been the subject of extensive investigations, it is still a matter of much research in the lit- erature [6,7]. Variable selection can be regarded as a combinatorial optimiza- tion problem involving the minimization of a cost function related to the analytical goal. In this sense, a variable selection strategy can be characterized by the type of cost function, the constraints im- posed on the combinations of variables, and the optimization algo- rithm itself [8]. Different options for these three features have been investigated in the literature, giving rise to several selection strate- gies [912]. However this topic is still a matter of much research in Chemometrics and related elds. Within the scope of multivariate calibration, Araújo and collabo- rators have proposed the Successive Projection Algorithm (SPA) for selection of variables in Multiple Linear Regression (MLR) modeling [13,14]. This algorithm is aimed at selecting a subset of variables with small multi-collinearity and suitable prediction power. The good results obtained by SPAMLR in different analytical prob- lems [15] motivated the extension of the algorithm to other elds of Chemometrics, such as calibration transfer [16], classication problem [1719] and sample selection [20]. As compared to multivariate calibration methods based on latent variables, such as Partial-Least-Squares (PLS), SPAMLR models have advantages in terms of simplicity and ease of interpretation. Moreover, in some reported cases, SPAMLR provided better predic- tion results compared to PLS, which may be ascribed to the removal of uninformative variables from the modeling process [2123]. How- ever, PLS models tend to be less sensitive to instrumental noise because of the averaging process involved in the calculation of latent variables from several redundant variables in the original domain Microchemical Journal 110 (2013) 202208 Paper presented at 5th Ibero-American Congress of Analytical Chemistry 2012. Corresponding author. Tel.: +55 83 3216 7438; fax: +55 83 3216 7437. E-mail address: edvan@quimica.ufpb.br (E.C. da Silva). 1 Tel.: +55 83 3315 3356. 0026-265X/$ see front matter © 2013 Elsevier B.V. All rights reserved. http://dx.doi.org/10.1016/j.microc.2013.03.015 Contents lists available at SciVerse ScienceDirect Microchemical Journal journal homepage: www.elsevier.com/locate/microc