Analytical Methods Simultaneous data pre-processing and SVM classification model selection based on a parallel genetic algorithm applied to spectroscopic data of olive oils Olivier Devos a, , Gerard Downey b , Ludovic Duponchel a a Laboratoire de Spectrochimie Infrarouge et Raman (LASIR CNRS UMR 8516), Université de Lille 1 Sciences et Technologies, Bât. C5, 59655 Villeneuve d’Ascq, France b Teagasc Food Research Centre Ashtown, Ashtown, Dublin 15, Ireland article info Article history: Received 18 February 2013 Received in revised form 19 September 2013 Accepted 2 October 2013 Available online 14 October 2013 Keywords: Genetic algorithm Spectral pre-processing Parameter optimisation Classification Support vector machines Infrared spectroscopy abstract Classification is an important task in chemometrics. For several years now, support vector machines (SVMs) have proven to be powerful for infrared spectral data classification. However such methods require optimisation of parameters in order to control the risk of overfitting and the complexity of the boundary. Furthermore, it is established that the prediction ability of classification models can be improved using pre-processing in order to remove unwanted variance in the spectra. In this paper we propose a new methodology based on genetic algorithm (GA) for the simultaneous optimisation of SVM parameters and pre-processing (GENOPT-SVM). The method has been tested for the discrimination of the geographical origin of Italian olive oil (Ligurian and non-Ligurian) on the basis of near infrared (NIR) or mid infrared (FTIR) spectra. Different classification models (PLS-DA, SVM with mean centre data, GENOPT-SVM) have been tested and statistically compared using McNemar’s statistical test. For the two datasets, SVM with optimised pre-processing give models with higher accuracy than the one obtained with PLS-DA on pre-processed data. In the case of the NIR dataset, most of this accuracy improvement (86.3% compared with 82.8% for PLS-DA) occurred using only a single pre-processing step. For the FTIR dataset, three optimised pre-processing steps are required to obtain SVM model with significant accuracy improvement (82.2%) compared to the one obtained with PLS-DA (78.6%). Furthermore, this study dem- onstrates that even SVM models have to be developed on the basis of well-corrected spectral data in order to obtain higher classification rates. Ó 2013 Elsevier Ltd. All rights reserved. 1. Introduction Vibrational spectroscopy techniques (near infrared, mid-infra- red and Raman) are rapid, non-destructive and generally non-inva- sive measurement methods. While food applications of Raman spectroscopy are now emerging, the infrared methods are already widely used in the agriculture, food and pharmaceutical industries for proximate analysis and quality control. In general, vibrational spectra contain compositional data which can be extracted using multivariate mathematical tools to yield quantitative (using regression models e.g. partial least squares regression, PLSR) or qualitative information (using classification e.g. partial least squares discriminant analysis, PLS-DA) or class-modelling (e.g. soft independent modelling of class analogy, SIMCA) solutions. It is common practice to apply a data pre-processing step to raw spec- tral data prior to modelling; this is because of the fact that, in addi- tion to chemical information, infrared spectra (especially NIR) contain random and systematic interferences from other sources (i.e. noise, stray light, light scatter, detector non-linearities, tem- perature variations, etc.) which have the potential to degrade mod- el performance and therefore should be removed. Typical pre- processing methods include multiplicative scatter correction (MSC), standard normal variate (SNV), 1st and 2nd derivatives (1Der and 2Der). Experience has shown that the use of one or more of these transformations can improve classification accuracy in the case of qualitative analysis and increase prediction accuracy of quantitative models. With particular regard to qualitative analysis, many methods exist for sample classification based on spectroscopic data (Balabin, Safieva, & Lomakina, 2010; Berrueta, Alonso-Salces, & Heberger, 2007). Very often the final choice of a classification algorithm depends on the structure of data being studied but the final selection criterion remains first and foremost the prediction performance obtained with any model. Support vector machines (SVMs) belong to a new generation of learning algorithms used for classification and regression tasks (Cristianni 0308-8146/$ - see front matter Ó 2013 Elsevier Ltd. All rights reserved. http://dx.doi.org/10.1016/j.foodchem.2013.10.020 Corresponding author. Tel.: +33 320 434 748; fax: +33 320 436 755. E-mail address: olivier.devos@univ-lille1.fr (O. Devos). Food Chemistry 148 (2014) 124–130 Contents lists available at ScienceDirect Food Chemistry journal homepage: www.elsevier.com/locate/foodchem