Distribution based truncation for variable selection in subspace
methods for multivariate regression
Kristian Hovde Liland
a,
⁎, Martin Høy
b
, Harald Martens
b, c
, Solve Sæbø
a
a
Norwegian University of Life Sciences, Department of Chemistry, Biotechnology and Food Science, P.O. Box 5003, N-1432 Ås, Norway
b
Nofima, Norwegian Institute of Food, Fisheries and Aquaculture Research Osloveien 1, N-1430 Ås, Norway
c
Norwegian University of Life Sciences, Department of Mathematical Sciences and Technology, P.O. Box 5003, N-1432 Ås, Norway
abstract article info
Article history:
Received 3 April 2012
Received in revised form 8 January 2013
Accepted 16 January 2013
Available online xxxx
Keywords:
Multivariate
Sparse
Truncation
Variable selection
Partial least squares
Relevant components
Analysis of data containing a vast number of features, but only a limited number of informative ones, requires
methods that can separate true signal from noise variables. One class of methods attempting this is the sparse
partial least squares methods for regression (sparse PLS). This paper aims at improving the theoretical foun-
dation, speed and robustness of such methods. A general justification of truncation of PLS loading weights is
achieved through distribution theory and the central limit theorem. We also introduce a quick plug-in based
truncation procedure based on a novel application of theory intended for analysis of variance for experiments
without replicates. The result is a versatile and intuitive method that performs component-wise variable se-
lection very efficiently and in a less ad hoc manner than existing methods. Prediction performance is on par
with existing methods, while robustness is ensured through a better theoretical foundation.
© 2013 Elsevier B.V. All rights reserved.
1. Introduction
One of the major challenges in recent and coming data analysis is the
ever increasing number of variables recorded for each sample. The data
matrices become wider and wider. Because of instrumental noise, bio-
logical noise and other uncontrollable variations in the recorded signal,
variables that should have no signal for a given sample, or are equal
across samples, almost never show a zero signal in the final centered
data set and differences between two signals that should be zero are sel-
dom zero in practice. Since predictive multivariate methods like partial
least squares regression (PLSR) [1] in their basic forms take into account
all variables, the sheer number of non-zero noise variables will often
over-shadow the true signal.
Various forms of variable selection approaches have been proposed in
the context of regression. Variable selection can also play a role in finding
important variables in explorative studies, with the purpose of stabilizing
the regression modeling and improving its predictive ability and inter-
pretability. Sometimes the aim is to find which variables influence a cer-
tain process causally, or at least convey the most interesting information,
e.g. metabolites, genes, wavenumbers, or molecular weights. Depending
on the aim of the study different selection strategies may be favorable
and the focus on how many variables to retain may be different.
Based on ideas of component-wise variable selection, sparseness
and normally distributed noise we propose to use distribution based
truncation to identify all unimportant model parameters that are
(or appear to be) non-zero due to random errors, and force these to-
wards zero. In the present PLSR context, this means to zero out small,
apparently random elements in all the loading weight vectors. The in-
tension is thereby to drastically reduce the problem of non-zero noise
contributions. In the following sections we will look at some related
methods intended for the same purpose and motivate a simple, intu-
itive and flexible strategy for truncation of non-informative variables.
Applications to real and simulated data and comparison with other
methods will also be presented.
2. Background
A basic assumption in statistics is the central limit theorem (CLT).
The CLT was first presented by Abraham de Moivre in 1733 and has
been formalized and interpreted under varying conditions and degrees
of strictness ever since. A simple interpretation is that as the number of
observations sampled from a random process increases, the distribution
of the mean (and the sum) will approach a normal distribution. More
interesting in this context is that many types of random noise are
seen as approximately normally distributed, and linear combinations
of such will tend even more towards the normal distribution. In this
paper we propose to use the CLT to distinguish between variables
with expected non-zero loading weights from the noisy variables with
loading weights with a zero-expectation. We refer to the new modeling
principle as Truncation-PLS in the following, and the resulting methods
Chemometrics and Intelligent Laboratory Systems 122 (2013) 103–111
⁎ Corresponding author. Tel.: +47 64965830.
E-mail address: kristian.liland@umb.no (K.H. Liland).
0169-7439/$ – see front matter © 2013 Elsevier B.V. All rights reserved.
http://dx.doi.org/10.1016/j.chemolab.2013.01.008
Contents lists available at SciVerse ScienceDirect
Chemometrics and Intelligent Laboratory Systems
journal homepage: www.elsevier.com/locate/chemolab