A comparison between subset selection and L1 regularisation with an application in spectroscopy Yi Guo ⁎, Mark Berman CSIRO Mathematics, Informatics & Statistics, Riverside Corporate Park, North Ryde 2113 NSW, Australia abstract article info Article history: Received 17 May 2012 Received in revised form 27 August 2012 Accepted 28 August 2012 Available online 6 September 2012 Keywords: Mixture model Unmixing Non-negative coefﬁcients Mahalanobis distance Canonical variates A reﬂectance spectrum measures the reﬂectance of a material at hundreds or thousands of wavelengths. It provides chemical information about the material. Many rock samples actually contain a mixture of minerals. Because of the differing chemical compositions of the component minerals, the spectra of the rock sample often enable us to “unmix” the spectra to identify their mineral components. This is done with the aid of a fairly large library of pure spectra and a relatively simple linear mixture model, but with non-negativity con- straints on some of the coefﬁcients in the model. For many years, we have used full subset selection methods to identify the composition of millions of samples. There are several difﬁculties with this approach, in partic- ular: (i) identifying the composition of large numbers of samples can be relatively slow, and (ii) estimating the number of components in a mixture is not as reliable as we would like, because both the deterministic and stochastic parts of our model are only approximations to reality, and hence classical statistical methods for deciding on the order of a linear model (e.g. F tests, AIC) do not work very well. Hence, ad hoc methods have had to be developed. In the hope of overcoming these difﬁculties, we have investigated the use of L1 regularisation as an alternative, because it is a convex optimisation problem and therefore there are efﬁcient methods for ﬁnding the unique optimum. Moreover, it is straightforward to carry out L1 regularisation incor- porating non-negativity constraints on some of the coefﬁcients. Unfortunately, L1 regularisation does not work as well as full subset selection does. We brieﬂy discuss a possible hybrid approach. Crown Copyright © 2012 Published by Elsevier B.V. All rights reserved. 1. Introduction There are several different types of spectroscopy (e.g. reﬂectance, ﬂuorescence, Raman). They provide different types of chemical infor- mation about materials. A reﬂectance spectrum measures the reﬂec- tance of a material at hundreds or thousands of wavelengths. The shortwave infrared (SWIR) region, covering the range (1300, 2500) nanometres (nm), is particularly useful for identifying many minerals important for exploration. (As a point of reference, visible light lies approximately in the range (400, 700) nm). A single spec- trum integrates light over a pixel at each wavelength. In many applica- tions, the spatial region covered by the pixel actually consists of several materials, so that the observed spectrum is a mixture of the spectra of the materials present in the pixel. This happens even in small pixels, such as the approximately 10 mm×10 mm pixel shown in Fig. 1(a). (A colour version is shown in [4], Fig. 1.1(a)). The overlaid rectangle represents the approximate boundaries of the pixel over which a spectral measurement has been made. The differ- ent shades of grey within the pixel indicate that it contains several dif- ferent minerals. The spectrum corresponding to that pixel is shown in Fig. 1(b). It has a number of diagnostic “absorption features”. These are the intermediate frequency features which point down. An expert in- terpreter of geological spectra would identify this spectrum as representing a mixture of the following mineral groups: White Mica, Kaolin and Chlorite. Some interpreters might also suggest that a Car- bonate is present; its presence is more subtle. It is rare to see mixtures of four (apparent) minerals in a SWIR spectrum; spectra consisting of one, two or three minerals are far more common. Our expert inter- preter collaborators have yet to identify one consisting of ﬁve or more minerals. Nevertheless, this is a useful example for our purposes, and we will discuss it in more detail later. The spectrum shown in Fig. 1(b) is one of about 18,000 spectra mea- sured down a relatively small core near East Ballarat, Victoria, Australia. Larger cores may require the measurement of over 100,000 spectra. The measurements are obtained using one of CSIRO's HyLogging™ suite of instruments – HyLogger™ and HyChips™ – (http://www.csiro. au/science/hylogging-systems.html), which have been developed over the last decade. These are now widely used around Australia for measur- ing the spectra of drill cores and drill chips. The large volumes of spectra produced by these instruments mean that there is a need in the mining industry for software that analyses the spectra measured down individual cores. Speed and accuracy are essential requirements of such software. CSIRO has developed a pack- age for analysing such data, called The Spectral Geologist™ (TSG), which is sold commercially (http://www.thespectralgeologist.com/). Chemometrics and Intelligent Laboratory Systems 118 (2012) 127–138 ⁎ Corresponding author. Tel.: +61 2 9325 3216; fax: +61 2 9325 3200. E-mail addresses: yi.guo@csiro.au (Y. Guo), mark.berman@csiro.au (M. Berman). 0169-7439/$ – see front matter. Crown Copyright © 2012 Published by Elsevier B.V. All rights reserved. http://dx.doi.org/10.1016/j.chemolab.2012.08.010 Contents lists available at SciVerse ScienceDirect Chemometrics and Intelligent Laboratory Systems journal homepage: www.elsevier.com/locate/chemolab