O2-PLS, a two-block (X±Y) latent variable regression (LVR) method with an integral OSC ®lter ² Johan Trygg 1 * and Svante Wold 2 1 Institute for Molecular Bioscience, University of Queensland, Australia 2 Research Group for Chemometrics, Institute of Chemistry, Umea ˚ University, Umea ˚, Sweden Received 15 August 2002; Accepted 30 October 2002 The O2-PLS method is derived from the basic partial least squares projections to latent structures (PLS) prediction approach. The importance of the covariation matrix (Y T X) is pointed out in relation to both the prediction model and the structured noise in both X and Y. Structured noise in X (or Y)is defined as the systematic variation of X (or Y) not linearly correlated with Y (or X). Examples in spectroscopyincludebaseline,driftandscattereffects.Ifstructurednoiseispresentin X,theexisting latent variable regression (LVR) methods, e.g. PLS, will have weakened score±loading correspon- dence beyond the first component. This negatively affects the interpretation of model parameters such as scores and loadings. The O2-PLS method models and predicts both X and Y and has an integral orthogonal signal correction (OSC) filter that separates the structured noise in X and Y from their joint X±Y covariation used in the prediction model. This leads to a minimal number of predictive components with full score±loading correspondence and also an opportunity to interpret the structured noise. In both a real and a simulated example, O2-PLS and PLS gave very similar predictions of Y. However, the interpretation of the prediction models was clearly improved with O2-PLS, because structured noise was present. In the NIR example, O2-PLS revealed a strong water peak and baseline offset in the structured noise components. In the simulated example the O2-PLS plot of observed versus predicted Y-scores (u vs u hat ) showed good predictions. The corresponding loading vectors provided good interpretation of the covarying analytes in X and Y. Copyright # 2003 John Wiley & Sons, Ltd. KEYWORDS: O2-PLS; O-PLS; latent variable regression; structured noise; score±loading correspondence; model interpretation 1. INTRODUCTION Spectroscopic (e.g. NMR, NIR) and chromatographic (e.g. GC, LC) techniques are frequently being used for the characterization of solid, semi-solid, fluid and vapor samples. Multivariate calibration methods [1] (e.g. partial least squares projections to latent structures (PLS)) are often used to develop a quantitative relation between the digitized spectra, the matrix X, and some properties (e.g. concentra- tions) of the analytes, the matrix Y. These methods may also be used to infer other more multivariate properties of samples, e.g. predicting NMR profiles from NIR spectra. This large quantity of information-rich data requires proper multivariate tools. Often structured noise is present in X (or Y), where structured noise is defined as systematic variation of X (or Y), which is not linearly correlated with Y (or X). Examples in spectroscopy are baseline and scatter effects, as well as spectra of impurities or unknown constituents. It has earlier been shown [2,3] that these negatively affect the interpretation of the existing latent variable regression (LVR) methods, e.g. PLS and other methods with similar proper- ties, and increase prediction model complexity. Preproces- sing methods, including derivatives, scatter corrections [4] and orthogonal signal correction (OSC) filters [2,5±10], can be applied to suppress this structured noise. This preproces- sing is then followed by two-block (regression/classifica- tion) modeling. However, some of these methods may remove pertinent variation from X. Some others fail to remove only the structured noise that disturbs the prediction model, and this leads to increased model complexity and can also result in worse predictions. Interpretation of LVR model parameters such as scores and loadings is linked to their score±loading correspon- dence. Strong structured noise in X diminishes the score± *Correspondence to: J. Trygg, Smythe Group/Gehrmann Building Floor 7, Institute for Molecular Bioscience, University of Queensland, Brisbane QLD 4072, Australia. E-mail: j.trygg@imb.uq.edu.au ² Dedicated to Professor John F. MacGregor: a pioneer of multivariate statistical process control and recipient of the fourth Herman Wold medal. Contract/grant sponsor: Knut and Alice Wallenberg Foundation. Contract/grant sponsor: Swedish Natural Science Research Council (NFR). Copyright # 2003 John Wiley & Sons, Ltd. JOURNAL OF CHEMOMETRICS J. Chemometrics 2003; 17: 53±64 Published online in Wiley InterScience (www.interscience.wiley.com). DOI: 10.1002/cem.775