Analytical Methods Simultaneous data pre-processing and SVM classiﬁcation model selection based on a parallel genetic algorithm applied to spectroscopic data of olive oils Olivier Devos a,⇑ , Gerard Downey b , Ludovic Duponchel a a Laboratoire de Spectrochimie Infrarouge et Raman (LASIR CNRS UMR 8516), Université de Lille 1 Sciences et Technologies, Bât. C5, 59655 Villeneuve d’Ascq, France b Teagasc Food Research Centre Ashtown, Ashtown, Dublin 15, Ireland article info Article history: Received 18 February 2013 Received in revised form 19 September 2013 Accepted 2 October 2013 Available online 14 October 2013 Keywords: Genetic algorithm Spectral pre-processing Parameter optimisation Classiﬁcation Support vector machines Infrared spectroscopy abstract Classiﬁcation is an important task in chemometrics. For several years now, support vector machines (SVMs) have proven to be powerful for infrared spectral data classiﬁcation. However such methods require optimisation of parameters in order to control the risk of overﬁtting and the complexity of the boundary. Furthermore, it is established that the prediction ability of classiﬁcation models can be improved using pre-processing in order to remove unwanted variance in the spectra. In this paper we propose a new methodology based on genetic algorithm (GA) for the simultaneous optimisation of SVM parameters and pre-processing (GENOPT-SVM). The method has been tested for the discrimination of the geographical origin of Italian olive oil (Ligurian and non-Ligurian) on the basis of near infrared (NIR) or mid infrared (FTIR) spectra. Different classiﬁcation models (PLS-DA, SVM with mean centre data, GENOPT-SVM) have been tested and statistically compared using McNemar’s statistical test. For the two datasets, SVM with optimised pre-processing give models with higher accuracy than the one obtained with PLS-DA on pre-processed data. In the case of the NIR dataset, most of this accuracy improvement (86.3% compared with 82.8% for PLS-DA) occurred using only a single pre-processing step. For the FTIR dataset, three optimised pre-processing steps are required to obtain SVM model with signiﬁcant accuracy improvement (82.2%) compared to the one obtained with PLS-DA (78.6%). Furthermore, this study dem- onstrates that even SVM models have to be developed on the basis of well-corrected spectral data in order to obtain higher classiﬁcation rates. Ó 2013 Elsevier Ltd. All rights reserved. 1. Introduction Vibrational spectroscopy techniques (near infrared, mid-infra- red and Raman) are rapid, non-destructive and generally non-inva- sive measurement methods. While food applications of Raman spectroscopy are now emerging, the infrared methods are already widely used in the agriculture, food and pharmaceutical industries for proximate analysis and quality control. In general, vibrational spectra contain compositional data which can be extracted using multivariate mathematical tools to yield quantitative (using regression models e.g. partial least squares regression, PLSR) or qualitative information (using classiﬁcation e.g. partial least squares discriminant analysis, PLS-DA) or class-modelling (e.g. soft independent modelling of class analogy, SIMCA) solutions. It is common practice to apply a data pre-processing step to raw spec- tral data prior to modelling; this is because of the fact that, in addi- tion to chemical information, infrared spectra (especially NIR) contain random and systematic interferences from other sources (i.e. noise, stray light, light scatter, detector non-linearities, tem- perature variations, etc.) which have the potential to degrade mod- el performance and therefore should be removed. Typical pre- processing methods include multiplicative scatter correction (MSC), standard normal variate (SNV), 1st and 2nd derivatives (1Der and 2Der). Experience has shown that the use of one or more of these transformations can improve classiﬁcation accuracy in the case of qualitative analysis and increase prediction accuracy of quantitative models. With particular regard to qualitative analysis, many methods exist for sample classiﬁcation based on spectroscopic data (Balabin, Saﬁeva, & Lomakina, 2010; Berrueta, Alonso-Salces, & Heberger, 2007). Very often the ﬁnal choice of a classiﬁcation algorithm depends on the structure of data being studied but the ﬁnal selection criterion remains ﬁrst and foremost the prediction performance obtained with any model. Support vector machines (SVMs) belong to a new generation of learning algorithms used for classiﬁcation and regression tasks (Cristianni 0308-8146/$ - see front matter Ó 2013 Elsevier Ltd. All rights reserved. http://dx.doi.org/10.1016/j.foodchem.2013.10.020 ⇑ Corresponding author. Tel.: +33 320 434 748; fax: +33 320 436 755. E-mail address: olivier.devos@univ-lille1.fr (O. Devos). Food Chemistry 148 (2014) 124–130 Contents lists available at ScienceDirect Food Chemistry journal homepage: www.elsevier.com/locate/foodchem