JOURNAL OF CHEMOMETRICS J. Chemometrics 2004; 18: 486–497 Published online in Wiley InterScience (www.interscience.wiley.com). DOI: 10.1002/cem.893 Sequential application of backward interval partial least squares and genetic algorithms for the selection of relevant spectral regions Riccardo Leardi 1 * and Lars NÖrgaard 2 1 Department of Pharmaceutical and Food Chemistry and Technology,University of Genoa,Genoa,Italy 2 Chemometrics Group, Department of Food Science,The Royal Veterinary and Agricultural University, Rolighedsvej 30, DK-1958 Frederiksberg, Denmark Received 13 September 2004; Accepted 27 January 2005 It is nowadays widely accepted that genetic algorithms (GAs) are powerful tools in variable selection and that after suitable modifications they can also be powerful in detecting the most relevant spectral regions for multivariate calibration. One of the main limitations of GAs is related to the fact that when spectral intensities are measured at a very large number of wavelengths the search domain increases correspondingly and therefore the detection of the relevant regions is much more difficult. A modification of interval partial least squares (iPLS), designated backward interval PLS (biPLS), is developed and studied such that it can detect and remove the least relevant regions, thereby reducing the search domain to a size that GAs can handle easily. In this paper the application to two different spectroscopic data sets will be shown: infrared spectroscopic analysis of polymer film additives and determination of the contents of erucic acid and total fatty acids in brassica seeds by near-infrared spectroscopy. The developed method is compared with model performances based on expert selection of variables as well as with results from application of the previously developed GA-PLS method. The sequential application of biPLS and GA-PLS has proven successful, and comparable or better results have been obtained, introducing a more automatic region selection procedure and a substantial decrease in computation time. Copyright # 2005 John Wiley & Sons, Ltd. KEYWORDS: genetic algorithms; backward interval partial least squares; region selection; variable selection; spectro- scopy; near-infrared; infrared 1. INTRODUCTION Good performances of Genetic Algorithms Partial Least Squares (GA-PLS) as a tool for wavelength selection have been reported in previous papers [1–5]. However, study of the behaviour of GA-PLS in such problems has suggested a limitation to no more than 200 variables, since it has been found empirically that a greater number of variables, i.e. a larger search domain, would reduce the capability of obtain- ing a solution with good predictive ability. Two different and apparently independent reasons could be at the root of this. The first lies in the fact that the greater the number of variables in the X matrix the greater is the probability of finding some chance correlations and there- fore of overfitting (this is the reason why a variables/objects ratio 5 was previously suggested), while the second can be explained by the exponential growth of the search domain (with k variables, 2 k 1 combinations are possible). When dealing with more than 200 wavelengths, the number of variables was previously reduced by applying windows of size n, in such a way that each new variable was the average of the signal intensities at n consecutive wave- lengths [1,2]. In the case of spectra with very narrow peaks this approach can be quite dangerous, since some spectral features can be smoothed too much and therefore lose their relevance. To avoid this problem, an iterative approach was followed in which the least relevant spectral regions (as defined by the frequency of selection by GA-PLS applied to the ‘windowed’ spectra) were successively removed and therefore the win- dow size could be reduced. This strategy produced good results [3] but required a huge amount of time, since several sets of GA-PLS had to be run (e.g. in the cited paper five GA- PLS calculations were run). Furthermore, at each GA-PLS calculation the decisions about which regions to discard had *Correspondence to: R. Leardi, Department of Pharmaceutical and Food Chemistry and Technology, University of Genoa, via Brigata Salerno (Ponte), I-16147 Genova, Italy. E-mail: riclea@dictfa.unige.it Contract/grant sponsor: Centre for Advanced Food Studies (Major Research Infrastructure). Copyright # 2005 John Wiley & Sons, Ltd.