Uncorrelated linear discriminant analysis (ULDA): A powerful tool for exploration of metabolomics data Dalin Yuan a , Yizeng Liang a, ⁎, Lunzhao Yi a , Qingsong Xu b , Olav M. Kvalheim c a Research Center of Modernization of Chinese Medicines, College of Chemistry and Chemical Engineering, Central South University, Changsha 410083, PR China b School of Mathematic Sciences, Central South University, Changsha 410083, PR China c Department of Chemistry, University of Bergen, Allégaten 41, N-5007 Bergen, Norway ABSTRACT ARTICLE INFO Article history: Received 26 December 2007 Received in revised form 13 April 2008 Accepted 14 April 2008 Available online 18 April 2008 Keywords: Discriminant analysis Biomarker screening Metabolomics Feature extraction Uncorrelated linear discriminant analysis (ULDA) The theory together with an algorithm for uncorrelated linear discriminant analysis (ULDA) is introduced and applied to explore metabolomics data. ULDA is a supervised method for feature extraction (FE), discriminant analysis (DA) and biomarker screening based on the Fisher criterion function. While principal component analysis (PCA) searches for directions of maximum variance in the data, ULDA seeks linearly combined variables called uncorrelated discriminant vectors (UDVs). The UDVs maximize the separation among different classes in terms of the Fisher criterion. The performance of ULDA is evaluated and compared with PCA, partial least squares discriminant analysis (PLS-DA) and target projection discriminant analysis (TP-DA) for two datasets, one simulated and one real from a metabolomic study. ULDA showed better discriminatory ability than PCA, PLS-DA and TP-DA. The shortcomings of PCA, PLS-DA and TP-DA are attributed to interference from linear correlations in data. PLS-DA and TP-DA performed successfully for the simulated data, but PLS-DA was slightly inferior to ULDA for the real data. ULDA successfully extracted optimal features for discriminant analysis and revealed potential biomarkers. Furthermore, by means of cross-validation, the classiﬁcation model obtained by ULDA showed better predictive ability than PCA, PLS-DA and TP-DA. In conclusion, ULDA is a powerful tool for revealing discriminatory information in metabolomics data. © 2008 Elsevier B.V. All rights reserved. 1. Introduction Metabolomics is deﬁned as “a comprehensive analysis of the whole metabolome under a given set of conditions” [1]. Metabolomics has advantages over other -omics approaches in efﬁciently building knowledge of biological status because the intermediary metabolism is proximal to phenotype, and metabolites can be measured quantitatively and comprehensively [2,3]. The change of metabolites reﬂects the effect of inside and outside factors on the living systems, such as grown age, climate, soil type and moisture content, temperature, stress factors, pathological changes and medication [4]. Metabolomics coupled with the multivariate discriminant analysis has shown its potential e.g. in plant genotype discrimination [5–7], toxicological screening [8,9], and, disease diagnosis [10–14]. Generally, the data produced in metabolomics studies are high- dimensional since they are commonly acquired on NMR, GC-MS and HPLC-MS instruments. The performance of methods for discriminant analysis and biomarker screening may degrade or become time- consuming with the high complexity of data. In order to solve this problem, feature extraction (FE) method is often employed. Another advantage of FE is visualization of the data structure. The data can be displayed in 2-D or 3-D space in which one gain insight into the data structure after dimensionality reduction. Principal component analy- sis (PCA) and Partial least squares discriminant analysis (PLS-DA) are the best known methods for feature extraction [14–20]. PCA trans- forms the original variables of the data X (spectral observations or integrated areas of peaks in measured spectrum) into a small orthogonal set of principal components that accounts for most of the variance [21]. However, the principal components cannot ensure any class-discriminatory information [22]. In addition, the unsuper- vised characteristics of PCA would result in an undesirable clustering of the data that emphasize the irrelevant random changes of the abundant metabolites instead of the changes associated with the functionally relevant low-concentration metabolites [23]. PLS-DA attempts to derive latent variables, which maximize the covariance between the measured data X and the response variable(s) y (Y); the ‘dummy’ indicator vector (matrix for more than two groups) [24,25]. As a supervised method, PLS-DA makes use of the classiﬁcatory information of sample belongings in the feature extraction. However, variables with large variance or high covariance can affect the procedure of PLS-DA, even thought these variables contain little or no information contributing to the discrimination of samples. This may result in loss of optimal features in some complicated situations. In order to improve the interpretative aspect of PLS, Kvalheim and Karstang developed an approach called target projection (TP) [26].A Chemometrics and Intelligent Laboratory Systems 93 (2008) 70–79 ⁎ Corresponding author. E-mail address: yizeng_liang@263.net (Y. Liang). 0169-7439/$ – see front matter © 2008 Elsevier B.V. All rights reserved. doi:10.1016/j.chemolab.2008.04.005 Contents lists available at ScienceDirect Chemometrics and Intelligent Laboratory Systems journal homepage: www.elsevier.com/locate/chemolab