Analytica Chimica Acta 733 (2012) 16–22 Contents lists available at SciVerse ScienceDirect Analytica Chimica Acta jou rn al hom epa ge: www.elsevier.com/locate/aca Systematic ratio normalization of gas chromatography signals for biological sample discrimination and biomarker discovery Benoist Lehallier a , Jérémy Ratel a , Mohamed Hanafi b , Erwan Engel a, a INRA, UR 370 QuaPA, MASS laboratory, 63122 Saint-Genès-Champanelle, France b ONIRIS, Sensometrics & Chemometrics Laboratory, site de la Géraudière, 44322 Nantes, France a r t i c l e i n f o Article history: Received 2 September 2011 Received in revised form 3 April 2012 Accepted 10 April 2012 Available online 25 April 2012 Keywords: Systematic ratio normalization Discrimination Biomarker Gas chromatography–mass spectrometry Volatile compounds a b s t r a c t The present paper introduces a new gas chromatography data processing procedure dubbed systematic ratio normalization (SRN) enabling to improve both sample set discrimination and biomarker iden- tification. SRN consists in (1) calculating, for each sample, all the log-ratios between abundances of chromatography-analyzed compounds, then (2) selecting the log-ratio(s) that best maximize the dis- crimination between sample-sets. The relevance of SRN was evaluated on two data sets acquired through gas chromatography–mass spectrometry as part of separate studies designed (i) to discriminate source- origins between vegetable oils analyzed via an analytical system exposed to instrument drift (data set 1) and (ii) to discriminate animal feed between meat samples aged for different durations (data set 2). Applying SRN to raw data made it possible to obtain robust discrimination models for the two data sets by enhancing the contribution to the data variance of the factor-of-interest while stabilizing the contribu- tion of the disturbance factor. The most discriminant log-ratios were shown to employ the most relevant biomarkers presenting relative independence of the factor-of-interest as well as co-behavior of the dis- turbance effects potentially biasing the discrimination, such as instrument drift or sample biochemical changes. SRN can be run a posteriori on any data set, and might be generalizable to most of separating methods. © 2012 Elsevier B.V. All rights reserved. 1. Introduction After the extensive work that is being done in the areas of metabolomics and proteomics [1,2], the discrimination of biolog- ical samples based on mass spectrometry has come of interest in its own rights [3]. Differentiating sample sets according to a factor-of-interest hinges on highlighting distinctive components that may only be present in trace amounts, while minimizing the incidence of other factors liable to even partially mask the discrim- inant function. Gas chromatography coupled mass spectrometry (GC–MS) is well-geared to handling the discrimination of complex matrices such as processed foods or biological samples, both in terms of technical accuracy and quantification of small-molecular- weight compounds [4–6]. Together with peak alignment, mass spectra deconvolution and compound identification [7], signal normalization represents one key bottleneck to a comprehen- sive discovery of distinctive biomarkers. Despite the increasingly powerful performance of commercially available instruments, extracting useful information from analytical signals still requires Corresponding author. Tel.: +33 04 73 62 45 89; fax: +33 04 73 62 47 31. E-mail address: erwan.engel@clermont.inra.fr (E. Engel). chemometric normalization tools in order to minimize the inci- dence of disturbance factors, some of which are tied to the technique employed while others are inherent to the sample itself [6,8]. Normalization is generally defined as a processing procedure designed to suppress systematic variance that is unrelated to the relevant signal [9,10]. Among the data normalization methods available, variable ratios emerged to become widely adopted in the twentieth century [11]. A number of authors have proposed to normalize each compound in the GC–MS signal via one or more compounds found naturally in every chromatogram and present- ing a variance that is independent of the factor studied [12]. This method, called diagnostic ratios, does however require a priori selection of reference compounds and affects co-variance between normalized variables [9]. Internal signal normalization by the sum of the signal components is commonly performed in data analy- sis to overcome the effects of variations in sensitivity and injected quantities on the intensity of recorded signals [13,14]. However, this procedure can prove insufficient, as the individual normalized variables have a high covariance due to the mode of normaliza- tion expression (relative to the percentage of the sum total) which generates statistical cross-links [15]. Moreover, this procedure may distort the data severely if the assumption of a lack of overall 0003-2670/$ see front matter © 2012 Elsevier B.V. All rights reserved. http://dx.doi.org/10.1016/j.aca.2012.04.019