Distribution based truncation for variable selection in subspace methods for multivariate regression Kristian Hovde Liland a, ⁎, Martin Høy b , Harald Martens b, c , Solve Sæbø a a Norwegian University of Life Sciences, Department of Chemistry, Biotechnology and Food Science, P.O. Box 5003, N-1432 Ås, Norway b Noﬁma, Norwegian Institute of Food, Fisheries and Aquaculture Research Osloveien 1, N-1430 Ås, Norway c Norwegian University of Life Sciences, Department of Mathematical Sciences and Technology, P.O. Box 5003, N-1432 Ås, Norway abstract article info Article history: Received 3 April 2012 Received in revised form 8 January 2013 Accepted 16 January 2013 Available online xxxx Keywords: Multivariate Sparse Truncation Variable selection Partial least squares Relevant components Analysis of data containing a vast number of features, but only a limited number of informative ones, requires methods that can separate true signal from noise variables. One class of methods attempting this is the sparse partial least squares methods for regression (sparse PLS). This paper aims at improving the theoretical foun- dation, speed and robustness of such methods. A general justiﬁcation of truncation of PLS loading weights is achieved through distribution theory and the central limit theorem. We also introduce a quick plug-in based truncation procedure based on a novel application of theory intended for analysis of variance for experiments without replicates. The result is a versatile and intuitive method that performs component-wise variable se- lection very efﬁciently and in a less ad hoc manner than existing methods. Prediction performance is on par with existing methods, while robustness is ensured through a better theoretical foundation. © 2013 Elsevier B.V. All rights reserved. 1. Introduction One of the major challenges in recent and coming data analysis is the ever increasing number of variables recorded for each sample. The data matrices become wider and wider. Because of instrumental noise, bio- logical noise and other uncontrollable variations in the recorded signal, variables that should have no signal for a given sample, or are equal across samples, almost never show a zero signal in the ﬁnal centered data set and differences between two signals that should be zero are sel- dom zero in practice. Since predictive multivariate methods like partial least squares regression (PLSR) [1] in their basic forms take into account all variables, the sheer number of non-zero noise variables will often over-shadow the true signal. Various forms of variable selection approaches have been proposed in the context of regression. Variable selection can also play a role in ﬁnding important variables in explorative studies, with the purpose of stabilizing the regression modeling and improving its predictive ability and inter- pretability. Sometimes the aim is to ﬁnd which variables inﬂuence a cer- tain process causally, or at least convey the most interesting information, e.g. metabolites, genes, wavenumbers, or molecular weights. Depending on the aim of the study different selection strategies may be favorable and the focus on how many variables to retain may be different. Based on ideas of component-wise variable selection, sparseness and normally distributed noise we propose to use distribution based truncation to identify all unimportant model parameters that are (or appear to be) non-zero due to random errors, and force these to- wards zero. In the present PLSR context, this means to zero out small, apparently random elements in all the loading weight vectors. The in- tension is thereby to drastically reduce the problem of non-zero noise contributions. In the following sections we will look at some related methods intended for the same purpose and motivate a simple, intu- itive and ﬂexible strategy for truncation of non-informative variables. Applications to real and simulated data and comparison with other methods will also be presented. 2. Background A basic assumption in statistics is the central limit theorem (CLT). The CLT was ﬁrst presented by Abraham de Moivre in 1733 and has been formalized and interpreted under varying conditions and degrees of strictness ever since. A simple interpretation is that as the number of observations sampled from a random process increases, the distribution of the mean (and the sum) will approach a normal distribution. More interesting in this context is that many types of random noise are seen as approximately normally distributed, and linear combinations of such will tend even more towards the normal distribution. In this paper we propose to use the CLT to distinguish between variables with expected non-zero loading weights from the noisy variables with loading weights with a zero-expectation. We refer to the new modeling principle as Truncation-PLS in the following, and the resulting methods Chemometrics and Intelligent Laboratory Systems 122 (2013) 103–111 ⁎ Corresponding author. Tel.: +47 64965830. E-mail address: kristian.liland@umb.no (K.H. Liland). 0169-7439/$ – see front matter © 2013 Elsevier B.V. All rights reserved. http://dx.doi.org/10.1016/j.chemolab.2013.01.008 Contents lists available at SciVerse ScienceDirect Chemometrics and Intelligent Laboratory Systems journal homepage: www.elsevier.com/locate/chemolab