Signiﬁcance of the structure of data in partial least squares regression predictions involving both natural and human experimental design Åsmund Rinnan* and Lars Munck When predicting the chemical composition of food samples from near-infrared spectroscopy using partial least squares regression, deep knowledge of the origin of the information is not present. We are aiming at opening a Pandora’s box of how the prediction of protein proceeds in a unique set of chemically diverse barley mutant samples. An external validation of the sources of co-variation in nature that are exploited by chemometric models would give a framework for manipulating the deciding information to make expensive calibration more economical. The barley samples were supplemented by two designed data sets: one mirroring the coarse composition of the barley samples by mixing six main chemical components and one set where the biological covariance between the six chemical components had been reduced. The three original data sets give remarkably comparable prediction models, albeit their regression coefﬁcients are quite different. The origin of the prediction ability of the data is elucidated by splitting the natural barley samples into two parts: one based on simulated biology extracted from a set of chemical mixtures, and the residual after the chemistry has been removed from the raw data. As much as 98.1% of the spectral information in the natural barley data is explained through the simulated biology, leaving as little as 1.9% of the spectral information for the unexplained biological variation and noise. However, unexplained biological variation still gives a fair prediction of protein (RMSECV = 1.23 and r 2 = 0.80, compared with RMSECV = 0.46 and r 2 = 0.97 for the natural data), and it gives a clear principal component analysis separation of the three genotype classes. The results were interpreted by conducting spectral inspection on the origin of the unique covariate patterns appearing in self-organised biological systems that should motivate researchers and industry to investigate the compressive effect that the model has on the essential deterministic biological data. Copyright © 2012 John Wiley & Sons, Ltd. Keywords: NIRS; PLSR; external validation; barley; barley mutant model 1. INTRODUCTION Validation of multivariate prediction models in chemometric literature is to a large extent focused on internal methods such as cross-validation. However, when analysing biological materi- als, the chemometric data analysis exploits the covariance of chemical and physical data components genetically and environmentally expressed in nature. It is of interest to get a more thorough knowledge of the system being modelled to improve the understanding of the variation important for the regression model. Partial least squares regression [1] (PLSR), in combination with near-infrared spectroscopy (NIRS), is widely used to predict chemical components, for example in tablet manufacturing in the pharmaceutical industry, and for seed grading in the food (malt) industry [2]. However, the complexity of the samples are very different, and the need of external validation to understand why and how the PLS prediction model works is fundamentally different. The purpose of this paper is to compare the fundaments of PLSR/NIRS calibration and prediction of protein in a set of diverse natural barley samples (n = 92). A sample set of biological samples ‘designed’ by nature was selected from a mutant barley material [3] developed since 1965 [4], consisting of three extreme classes of barley seed genotypes including normal barley (N) and speciﬁc regulative protein (P) and structural carbohydrate (C) mutants grown in extreme environmental combinations (greenhouse, ﬁeld locations, years). As can be seen from Table 1, there is a large difference in the correla- tions of the major components in the three classes of N, P and C barley. A subset of one third of the 92 samples was simulated by the mixture of the six major chemical components of cereals (barley) in puriﬁed form. In addition, an experiment of chemi- cal mixtures was designed that mirrors the ranges of barley seed composition in the ﬁrst set but where the biological co-variation between the ﬁve components, starch, protein, b-glucan (cell wall substances), water and fat, has been greatly reduced. Wavelengths of importance for the models are deﬁned using PLSR diagnostic methods and compared with those used in classical spectroscopy, that is correlation spectra and assign- ments from literature. The possibilities and limits of external validation to acquire robust PLSR prediction models using NIRS are ﬁnally discussed. * Correspondence to: Åsmund Rinnan, Department of Food Science, Faculty of Life Sciences, University of Copenhagen, Frederiksberg, Denmark. E-mail: aar@life.ku.dk Å. Rinnan, L. Munck Department of Food Science, Faculty of Life Sciences, University of Copenhagen, Frederiksberg, Denmark Research Article Received: 05 September 2011, Revised: 13 February 2012, Accepted: 13 February 2012, Published online in Wiley Online Library: 2012 (wileyonlinelibrary.com) DOI: 10.1002/cem.2438 J. Chemometrics (2012) Copyright © 2012 John Wiley & Sons, Ltd.