Significance of the structure of data in partial
least squares regression predictions involving
both natural and human experimental design
Åsmund Rinnan* and Lars Munck
When predicting the chemical composition of food samples from near-infrared spectroscopy using partial least
squares regression, deep knowledge of the origin of the information is not present. We are aiming at opening a
Pandora’s box of how the prediction of protein proceeds in a unique set of chemically diverse barley mutant
samples. An external validation of the sources of co-variation in nature that are exploited by chemometric models
would give a framework for manipulating the deciding information to make expensive calibration more economical.
The barley samples were supplemented by two designed data sets: one mirroring the coarse composition of the
barley samples by mixing six main chemical components and one set where the biological covariance between
the six chemical components had been reduced.
The three original data sets give remarkably comparable prediction models, albeit their regression coefficients are
quite different. The origin of the prediction ability of the data is elucidated by splitting the natural barley samples
into two parts: one based on simulated biology extracted from a set of chemical mixtures, and the residual after the
chemistry has been removed from the raw data. As much as 98.1% of the spectral information in the natural barley
data is explained through the simulated biology, leaving as little as 1.9% of the spectral information for the
unexplained biological variation and noise. However, unexplained biological variation still gives a fair prediction
of protein (RMSECV = 1.23 and r
2
= 0.80, compared with RMSECV = 0.46 and r
2
= 0.97 for the natural data), and it
gives a clear principal component analysis separation of the three genotype classes. The results were interpreted
by conducting spectral inspection on the origin of the unique covariate patterns appearing in self-organised
biological systems that should motivate researchers and industry to investigate the compressive effect that the
model has on the essential deterministic biological data. Copyright © 2012 John Wiley & Sons, Ltd.
Keywords: NIRS; PLSR; external validation; barley; barley mutant model
1. INTRODUCTION
Validation of multivariate prediction models in chemometric
literature is to a large extent focused on internal methods such
as cross-validation. However, when analysing biological materi-
als, the chemometric data analysis exploits the covariance of
chemical and physical data components genetically and
environmentally expressed in nature. It is of interest to get a
more thorough knowledge of the system being modelled to
improve the understanding of the variation important for the
regression model. Partial least squares regression [1] (PLSR), in
combination with near-infrared spectroscopy (NIRS), is widely
used to predict chemical components, for example in tablet
manufacturing in the pharmaceutical industry, and for seed
grading in the food (malt) industry [2]. However, the complexity
of the samples are very different, and the need of external
validation to understand why and how the PLS prediction model
works is fundamentally different.
The purpose of this paper is to compare the fundaments of
PLSR/NIRS calibration and prediction of protein in a set of diverse
natural barley samples (n = 92). A sample set of biological
samples ‘designed’ by nature was selected from a mutant barley
material [3] developed since 1965 [4], consisting of three
extreme classes of barley seed genotypes including normal
barley (N) and specific regulative protein (P) and structural
carbohydrate (C) mutants grown in extreme environmental
combinations (greenhouse, field locations, years). As can be
seen from Table 1, there is a large difference in the correla-
tions of the major components in the three classes of N, P
and C barley.
A subset of one third of the 92 samples was simulated by
the mixture of the six major chemical components of cereals
(barley) in purified form. In addition, an experiment of chemi-
cal mixtures was designed that mirrors the ranges of barley
seed composition in the first set but where the biological
co-variation between the five components, starch, protein,
b-glucan (cell wall substances), water and fat, has been greatly
reduced.
Wavelengths of importance for the models are defined
using PLSR diagnostic methods and compared with those used
in classical spectroscopy, that is correlation spectra and assign-
ments from literature. The possibilities and limits of external
validation to acquire robust PLSR prediction models using NIRS
are finally discussed.
* Correspondence to: Åsmund Rinnan, Department of Food Science, Faculty of
Life Sciences, University of Copenhagen, Frederiksberg, Denmark.
E-mail: aar@life.ku.dk
Å. Rinnan, L. Munck
Department of Food Science, Faculty of Life Sciences, University of Copenhagen,
Frederiksberg, Denmark
Research Article
Received: 05 September 2011, Revised: 13 February 2012, Accepted: 13 February 2012, Published online in Wiley Online Library: 2012
(wileyonlinelibrary.com) DOI: 10.1002/cem.2438
J. Chemometrics (2012) Copyright © 2012 John Wiley & Sons, Ltd.