Application of genetic algorithm–PLS for feature selection in spectral data sets Riccardo Leardi* Department of Pharmaceutical and Food Chemistry and Technology, University of Genova, Via Brigata Salerno (Ponte), I-16147 Genova, Italy SUMMARY After suitable modifications, genetic algorithms can be a useful tool in the problem of wavelength selection in the case of a multivariate calibration performed by PLS. Unlike what happens with the majority of feature selection methods applied to spectral data, the variables selected by the algorithm often correspond to well-defined and characteristic spectral regions instead of being single variables scattered throughout the spectrum. This leads to a model having a better predictive ability than the full-spectrum model; furthermore, the analysis of the selected regions can be a valuable help in understanding which are the relevant parts of the spectra. After the presentation of the algorithm, several real cases are shown. Copyright 2000 John Wiley & Sons, Ltd. KEY WORDS: genetic algorithms; feature selection; PLS regression; spectral data 1. INTRODUCTION Nowadays, spectral data are perhaps the most common type of data to which chemometric techniques are applied. Owing to the development of new instrumentation, data sets in which each object is described by several hundreds of variables can be easily obtained. Methods such as partial least squares (PLS) or principal component regression (PCR), being based on latent variables, allow one to take into account the whole spectrum without having to perform a previous feature selection [1,2]. Owing to their capability of extracting the relevant part of the information and of producing reliable models, till not so many years ago it was considered that these full-spectrum methods were almost insensitive to noise and therefore it was commonly stated that no feature selection at all was required [2]. In the last few years it has instead been recognized that an efficient feature selection can be highly beneficial both to improve the predictive ability of the model and to greatly reduce its complexity [3]. In the last few years, several techniques devoted to feature selection in PLS models applied to spectral data have been presented. Three of these methods are iterative variable selection (IVS) [4], uninformative variable elimination (UVE) [5] and iterative predictor weighting (IPW) [6]. JOURNAL OF CHEMOMETRICS J. Chemometrics 2000; 14: 643–655 * Correspondence to: R. Leardi, Department of Pharmaceutical and Food Chemistry and Technology, University of Genova, Via Brigata Salerno (Ponte), I-16147 Genova, Italy. E-mail: riclea@dictfa.unige.it Contract/grant sponsor: Italian Ministry of University and Scientific Research. Contract/grant sponsor: CNR (Italian National Research Council), Comitato Scienze e Tecnologia Informazione. Copyright 2000 John Wiley & Sons, Ltd. Received 13 September 1999 Accepted 20 March 2000