Biomarker discovery in mass spectral proles by means of selectivity ratio plot Tarja Rajalahti a,b , Reidar Arneberg c , Frode S. Berven d , Kjell-Morten Myhr a,b,e , Rune J. Ulvik d,f , Olav M. Kvalheim g, a Department of Clinical Medicine, University of Bergen, Bergen, Norway b Department of Neurology, Haukeland University Hospital, Bergen, Norway c Pattern Recognition Systems AS, Bergen, Norway d Institute of Medicine, University of Bergen, Bergen, Norway e The Norwegian Multiple Sclerosis National Competence Centre, Haukeland University Hospital, Bergen, Norway f Laboratory of Clinical Biochemistry, Haukeland University Hospital, Bergen, Norway g Department of Chemistry, University of Bergen, N-5007 Bergen, Norway abstract article info Article history: Received 7 April 2008 Received in revised form 15 July 2008 Accepted 6 August 2008 Available online 22 August 2008 Keywords: Biomarkers Variable selection Target projection Discriminant analysis Cerebrospinal uid This work presents a new method for variable selection in complex spectral proles. The method is validated by comparing samples from cerebrospinal uid (CSF) with the same samples spiked with peptide and protein standards at different concentration levels. Partial least squares discriminant analysis (PLS-DA) attempts to separate two groups of samples by regressing on a y-vector consisting of zeros and ones in the PLS decomposition. In most cases, several PLS components are needed to optimize the discrimination between groups. This creates difculties for the interpretation of the model. By using the y-vector as a target, it is possible to transform the PLS components to obtain a single predictive target-projected component analogously to the predictive component in orthogonal partial least squares discriminant analysis (OPLS-DA). By calculating the ratio between explained and residual variance of the spectral variables on the target- projected component, a selectivity ratio plot is obtained that can be used for variable selection. Used on whole mass spectral proles of pure and spiked CSF, we can detect peptide in the low molecular mass range (7409000 Da) at least down to 400 pM level without severe problems with false biomarker candidates. Similarly, we detect added proteins at least down to 2 nM level in the medium mass range (600017,500 Da). Target projection represents the optimal way to t a latent variable decomposition to a known target, but the selectivity ratio plot can be used for OPLS as well as other methods that produce a single predictive component. Comparison with some commonly used tools for variable selection shows that the selectivity ratio plot has the best performance. This observation is attributed to the fact that target projection utilizes both the predictive ability (regression coefcients) and the explanatory ability (spectral variance/covariance matrix) for the calculation of the selectivity ratio. © 2008 Elsevier B.V. All rights reserved. 1. Introduction Mass spectral characterization of body uids such as urine [1], blood [26], cerebrospinal uid [7] and sweat [8] followed by multivariate analysis represents a useful approach to reveal biomar- kers in metabolomic and proteomic research. With the development of matrix assisted laser desorption/ionization time-of-ight (MALDI- TOF) [9] and other methods for mass spectrometry, fractions with hundreds and even thousands of proteins have become accessible to chemical proling at the molecular level. A common objective for such proling is to reveal differences in composition between fractions of body uids from healthy controls and persons with a specic disease [3,5,10]. Components that differ in absolute or relative amounts between the controls and patients may have diagnostic value and at the same time provide clues about the pathogenesis of a disease. The ratio of number of spectral variables to available samples may be larger than 1000 for such studies. This makes it difcult to avoid a forestof false biomarker candidates. Several approaches for analyzing the multivariate correlation structure of mass proles are available. Principal component analysis (PCA) [11] whereby the data matrix of spectral intensities, i.e. samples time mass-to-charge (m/z) proles, is decomposed into uncorrelated latent variables, represents a common approach to search for patterns in multivariate data. Unfortunately, the possible signature for discriminating patients from controls is usually buried in dominating shared patterns. This is a consequence of the fact that most components in a body uid are not altered in the early phase of a disease. A better approach is therefore to use methods that attempt to highlight components that separate controls from patients. One such method that has gained large use is partial least squares discriminant analysis (PLS-DA) [12]. This multivariate regression method uses a Chemometrics and Intelligent Laboratory Systems 95 (2009) 3548 Corresponding author. Tel.: +47 55583366; fax: +47 55589490. E-mail address: Olav.Kvalheim@kj.uib.no (O.M. Kvalheim). 0169-7439/$ see front matter © 2008 Elsevier B.V. All rights reserved. doi:10.1016/j.chemolab.2008.08.004 Contents lists available at ScienceDirect Chemometrics and Intelligent Laboratory Systems journal homepage: www.elsevier.com/locate/chemolab