Explanatory multivariate analysis of ToF-SIMS spectra for the discrimination of bacterial isolates Seetharaman Vaidyanathan,† * a John S. Fletcher, a Roger M. Jarvis, b Alex Henderson, a Nicholas P. Lockyer, a Royston Goodacre b and John C. Vickerman a Received 15th April 2009, Accepted 26th August 2009 First published as an Advance Article on the web 14th September 2009 DOI: 10.1039/b907570d Multivariate analysis (PC-CVA and GA-CVA) was carried out on time-of-flight secondary ion mass spectra (ToF-SIMS) derived from 16 bacterial isolates associated with urinary tract infections, with an objective of extracting the spectral information relevant to their species-level discrimination. The use of spectral pre-processing, such as removal of the dominant peaks prior to analysis and analysis of the dominant peaks alone, enabled the identification of 37 peaks contributing to the principal components- canonical variates analysis (PC-CVA) discrimination of the bacterial isolates in the mass range of m/z 1–1000. These included signals at m/z 70, 84, 120, 134, 140, 150, 175 and 200. A univariate statistical analysis (Kruskal–Wallis) of the signal intensities at the identified m/z enabled an understanding of the discriminatory basis, which can be used in the development of robust parsimonious models for predictive purposes. The utility of genetic algorithm (GA)-based feature selection in identifying the discriminatory variables is also demonstrated. A database search of the identified signals enabled the biochemical origins of some these signals to be postulated. Introduction Time-of-flight secondary ion mass spectrometry (ToF-SIMS) is a surface technique that yields spectral information useful in discerning chemical changes associated with the surface being analysed. 1 The advent of cluster ion sources, such as C 60 + , 2 has enabled the application of the technique to derive molecular information from surfaces. 3 Of particular interest is the devel- opment of the technique to analyse biological surfaces. 4,5 The potential of the technique in discriminating bacterial isolates has been demonstrated in earlier investigations. 6,7 Although these investigations demonstrate the ability of the technique to generate spectral information for discriminatory purposes, the value of the technique will be strengthened by seeking explana- tory analysis of the discriminating variables, both in terms of chemistry and the associated biology. In this investigation we have sought to explain the discrimination of 16 bacterial isolates associated with urinary tract infections by employing multivar- iate analysis of the ToF-SIMS spectral data and following it with univariate statistics and bioinformatics. Multivariate analysis methods such as principal component analysis (PCA) and discriminant function analysis (DFA) have been shown to be useful in exploring the ToF-SIMS spectral information. 6–9 Deconvolution of the spectra with a view to understanding the basis of such analyses is a key challenge that will enable construction of robust models for predictive purposes. An earlier investigation discussed the application of PC-DFA to the discrimination of bacterial isolates associated with urinary tract infection (UTI), based on their ToF-SIMS spectra. 6 UTI, prevalent in adult women, is a considerable problem in general practice with high consultation rates, 10 and there is a growing need for rapid methods to screen for causal agent(s) prior to antibiotic treatment. In this investigation, we report the application of PC-canon- ical variates analysis 11,12 (PC-CVA) and genetic algorithms (GAs) on the spectral dataset with a view to extracting the spectral information relevant to the discrimination at the species level. Through the interpretation of spectral loadings plots, where high loadings for a factor indicate spectral components of particular importance to the discrimination, it is possible to discern the chemical basis of the classification model. 13 However, for complex data, such as mass spectra, where many indepen- dently measured variables are recorded, spectral loadings plots recovered from PC-CVA can be very difficult to interpret, and it is not always apparent which combinations of small numbers of variables have good discriminatory ability. Notwithstanding, it is desirable to reduce the solution to a classification problem down to a handful of spectral variables so that simple, interpretable rules can be achieved. Therefore, in order to determine variable subset combinations that contribute most to the class separation observed with PC-CVA, GA feature selection coupled directly to CVA was examined. GAs belong to a group of evolutionary algorithms that have been shown to be useful in feature selection of spectroscopic data. 14–18 In a feature selection context, the GA is used to select small subsets of spectral variables to assess against a cost function; in this case we seek to maximise the between-group variance and minimise the within-group variance between a priori classes in CVA scores space. a School of Chemical Engineering and Analytical Science, Manchester Interdisciplinary Biocentre, University of Manchester, 131 Princess Street, Manchester, UK M1 7ND. E-mail: S.Vaidyanathan@sheffield.ac. uk b School of Chemistry, Manchester Interdisciplinary Biocentre, University of Manchester, 131 Princess Street, Manchester, UK M1 7ND † Present address: ChELSI, Department of Chemical & Process Engineering, University of Sheffield, Sheffield, UK S1 3JD. 2352 | Analyst, 2009, 134, 2352–2360 This journal is ª The Royal Society of Chemistry 2009 PAPER www.rsc.org/analyst | Analyst