Biomarker discovery in mass spectral profiles by means of selectivity ratio plot
Tarja Rajalahti
a,b
, Reidar Arneberg
c
, Frode S. Berven
d
, Kjell-Morten Myhr
a,b,e
,
Rune J. Ulvik
d,f
, Olav M. Kvalheim
g,
⁎
a
Department of Clinical Medicine, University of Bergen, Bergen, Norway
b
Department of Neurology, Haukeland University Hospital, Bergen, Norway
c
Pattern Recognition Systems AS, Bergen, Norway
d
Institute of Medicine, University of Bergen, Bergen, Norway
e
The Norwegian Multiple Sclerosis National Competence Centre, Haukeland University Hospital, Bergen, Norway
f
Laboratory of Clinical Biochemistry, Haukeland University Hospital, Bergen, Norway
g
Department of Chemistry, University of Bergen, N-5007 Bergen, Norway
abstract article info
Article history:
Received 7 April 2008
Received in revised form 15 July 2008
Accepted 6 August 2008
Available online 22 August 2008
Keywords:
Biomarkers
Variable selection
Target projection
Discriminant analysis
Cerebrospinal fluid
This work presents a new method for variable selection in complex spectral profiles. The method is validated
by comparing samples from cerebrospinal fluid (CSF) with the same samples spiked with peptide and protein
standards at different concentration levels. Partial least squares discriminant analysis (PLS-DA) attempts to
separate two groups of samples by regressing on a y-vector consisting of zeros and ones in the PLS
decomposition. In most cases, several PLS components are needed to optimize the discrimination between
groups. This creates difficulties for the interpretation of the model. By using the y-vector as a target, it is
possible to transform the PLS components to obtain a single predictive target-projected component
analogously to the predictive component in orthogonal partial least squares discriminant analysis (OPLS-DA).
By calculating the ratio between explained and residual variance of the spectral variables on the target-
projected component, a selectivity ratio plot is obtained that can be used for variable selection. Used on
whole mass spectral profiles of pure and spiked CSF, we can detect peptide in the low molecular mass range
(740–9000 Da) at least down to 400 pM level without severe problems with false biomarker candidates.
Similarly, we detect added proteins at least down to 2 nM level in the medium mass range (6000–17,500 Da).
Target projection represents the optimal way to fit a latent variable decomposition to a known target, but the
selectivity ratio plot can be used for OPLS as well as other methods that produce a single predictive
component. Comparison with some commonly used tools for variable selection shows that the selectivity
ratio plot has the best performance. This observation is attributed to the fact that target projection utilizes
both the predictive ability (regression coefficients) and the explanatory ability (spectral variance/covariance
matrix) for the calculation of the selectivity ratio.
© 2008 Elsevier B.V. All rights reserved.
1. Introduction
Mass spectral characterization of body fluids such as urine [1],
blood [2–6], cerebrospinal fluid [7] and sweat [8] followed by
multivariate analysis represents a useful approach to reveal biomar-
kers in metabolomic and proteomic research. With the development
of matrix assisted laser desorption/ionization time-of-flight (MALDI-
TOF) [9] and other methods for mass spectrometry, fractions with
hundreds and even thousands of proteins have become accessible to
chemical profiling at the molecular level. A common objective for such
profiling is to reveal differences in composition between fractions of
body fluids from healthy controls and persons with a specific disease
[3,5,10]. Components that differ in absolute or relative amounts
between the controls and patients may have diagnostic value and at
the same time provide clues about the pathogenesis of a disease. The
ratio of number of spectral variables to available samples may be
larger than 1000 for such studies. This makes it difficult to avoid a
“forest” of false biomarker candidates.
Several approaches for analyzing the multivariate correlation
structure of mass profiles are available. Principal component analysis
(PCA) [11] whereby the data matrix of spectral intensities, i.e. samples
time mass-to-charge (m/z) profiles, is decomposed into uncorrelated
latent variables, represents a common approach to search for patterns
in multivariate data. Unfortunately, the possible signature for
discriminating patients from controls is usually buried in dominating
shared patterns. This is a consequence of the fact that most
components in a body fluid are not altered in the early phase of a
disease. A better approach is therefore to use methods that attempt to
highlight components that separate controls from patients. One such
method that has gained large use is partial least squares discriminant
analysis (PLS-DA) [12]. This multivariate regression method uses a
Chemometrics and Intelligent Laboratory Systems 95 (2009) 35–48
⁎ Corresponding author. Tel.: +47 55583366; fax: +47 55589490.
E-mail address: Olav.Kvalheim@kj.uib.no (O.M. Kvalheim).
0169-7439/$ – see front matter © 2008 Elsevier B.V. All rights reserved.
doi:10.1016/j.chemolab.2008.08.004
Contents lists available at ScienceDirect
Chemometrics and Intelligent Laboratory Systems
journal homepage: www.elsevier.com/locate/chemolab