Uncorrelated linear discriminant analysis (ULDA): A powerful tool for
exploration of metabolomics data
Dalin Yuan
a
, Yizeng Liang
a,
⁎, Lunzhao Yi
a
, Qingsong Xu
b
, Olav M. Kvalheim
c
a
Research Center of Modernization of Chinese Medicines, College of Chemistry and Chemical Engineering, Central South University, Changsha 410083, PR China
b
School of Mathematic Sciences, Central South University, Changsha 410083, PR China
c
Department of Chemistry, University of Bergen, Allégaten 41, N-5007 Bergen, Norway
ABSTRACT ARTICLE INFO
Article history:
Received 26 December 2007
Received in revised form 13 April 2008
Accepted 14 April 2008
Available online 18 April 2008
Keywords:
Discriminant analysis
Biomarker screening
Metabolomics
Feature extraction
Uncorrelated linear discriminant analysis
(ULDA)
The theory together with an algorithm for uncorrelated linear discriminant analysis (ULDA) is introduced and
applied to explore metabolomics data. ULDA is a supervised method for feature extraction (FE), discriminant
analysis (DA) and biomarker screening based on the Fisher criterion function. While principal component
analysis (PCA) searches for directions of maximum variance in the data, ULDA seeks linearly combined
variables called uncorrelated discriminant vectors (UDVs). The UDVs maximize the separation among
different classes in terms of the Fisher criterion. The performance of ULDA is evaluated and compared with
PCA, partial least squares discriminant analysis (PLS-DA) and target projection discriminant analysis (TP-DA)
for two datasets, one simulated and one real from a metabolomic study. ULDA showed better discriminatory
ability than PCA, PLS-DA and TP-DA. The shortcomings of PCA, PLS-DA and TP-DA are attributed to
interference from linear correlations in data. PLS-DA and TP-DA performed successfully for the simulated
data, but PLS-DA was slightly inferior to ULDA for the real data. ULDA successfully extracted optimal features
for discriminant analysis and revealed potential biomarkers. Furthermore, by means of cross-validation, the
classification model obtained by ULDA showed better predictive ability than PCA, PLS-DA and TP-DA. In
conclusion, ULDA is a powerful tool for revealing discriminatory information in metabolomics data.
© 2008 Elsevier B.V. All rights reserved.
1. Introduction
Metabolomics is defined as “a comprehensive analysis of the whole
metabolome under a given set of conditions” [1]. Metabolomics has
advantages over other -omics approaches in efficiently building
knowledge of biological status because the intermediary metabolism
is proximal to phenotype, and metabolites can be measured
quantitatively and comprehensively [2,3]. The change of metabolites
reflects the effect of inside and outside factors on the living systems,
such as grown age, climate, soil type and moisture content,
temperature, stress factors, pathological changes and medication [4].
Metabolomics coupled with the multivariate discriminant analysis has
shown its potential e.g. in plant genotype discrimination [5–7],
toxicological screening [8,9], and, disease diagnosis [10–14].
Generally, the data produced in metabolomics studies are high-
dimensional since they are commonly acquired on NMR, GC-MS and
HPLC-MS instruments. The performance of methods for discriminant
analysis and biomarker screening may degrade or become time-
consuming with the high complexity of data. In order to solve this
problem, feature extraction (FE) method is often employed. Another
advantage of FE is visualization of the data structure. The data can be
displayed in 2-D or 3-D space in which one gain insight into the data
structure after dimensionality reduction. Principal component analy-
sis (PCA) and Partial least squares discriminant analysis (PLS-DA) are
the best known methods for feature extraction [14–20]. PCA trans-
forms the original variables of the data X (spectral observations or
integrated areas of peaks in measured spectrum) into a small
orthogonal set of principal components that accounts for most of
the variance [21]. However, the principal components cannot ensure
any class-discriminatory information [22]. In addition, the unsuper-
vised characteristics of PCA would result in an undesirable clustering
of the data that emphasize the irrelevant random changes of the
abundant metabolites instead of the changes associated with the
functionally relevant low-concentration metabolites [23]. PLS-DA
attempts to derive latent variables, which maximize the covariance
between the measured data X and the response variable(s) y (Y); the
‘dummy’ indicator vector (matrix for more than two groups) [24,25].
As a supervised method, PLS-DA makes use of the classificatory
information of sample belongings in the feature extraction. However,
variables with large variance or high covariance can affect the
procedure of PLS-DA, even thought these variables contain little or
no information contributing to the discrimination of samples. This
may result in loss of optimal features in some complicated situations.
In order to improve the interpretative aspect of PLS, Kvalheim and
Karstang developed an approach called target projection (TP) [26].A
Chemometrics and Intelligent Laboratory Systems 93 (2008) 70–79
⁎ Corresponding author.
E-mail address: yizeng_liang@263.net (Y. Liang).
0169-7439/$ – see front matter © 2008 Elsevier B.V. All rights reserved.
doi:10.1016/j.chemolab.2008.04.005
Contents lists available at ScienceDirect
Chemometrics and Intelligent Laboratory Systems
journal homepage: www.elsevier.com/locate/chemolab