Software Description A MATLAB toolbox for Principal Component Analysis and unsupervised exploration of data structure Davide Ballabio Milano Chemometrics and QSAR Research Group, Department of Earth and Environmental Sciences, University of MilanoBicocca, Milano, Italy abstract article info Article history: Received 8 September 2015 Received in revised form 16 September 2015 Accepted 3 October 2015 Available online 19 October 2015 Keywords: Principal Component Analysis Rank analysis Cluster Analysis Multidimensional Scaling MATLAB Principal Component Analysis is a multivariate method to project data in a reduced hyperspace, dened by orthogonal principal components, which are linear combinations of the original variables. In this way, data di- mension can be reduced, noise can be excluded from the subsequent analysis, and therefore, data interpretation is extremely facilitated. For these reasons, Principal Component Analysis is nowadays the most common chemo- metric strategy for unsupervised exploratory data analysis. In this paper, the PCA toolbox for MATLAB is described. This is a collection of modules for calculating Principal Component Analysis, as well as Cluster Analysis and Multidimensional Scaling, which are two other well- known multivariate methods for unsupervised data exploration. The toolbox is freely available via Internet and comprises a graphical user interface (GUI), which allows the calculation in an easy-to-use graphical environment. It aims to be useful for both beginners and advanced users. The use of the toolbox is discussed here with an appropriate practical example. © 2015 Elsevier B.V. All rights reserved. 1. Introduction Principal Component Analysis (PCA) is a well-known chemometric technique for exploratory data analysis; it basically projects data in a re- duced hyperspace, dened by orthogonal principal components [1,2]. These are linear combinations of the original variables, with the rst principal component having the largest variance, the second principal component having the second largest variance, and so on. It is thus possible to select a number of signicant components, so that data di- mension is reduced by preserving the systematic variation in the data retained in the rst selected components, while noise is excluded, being represented in the last components. Therefore, PCA enhances and facilitates data exploration and interpretation of multivariate datasets. In addition to PCA, two other common chemometric strategies for unsupervised data analysis are Cluster Analysis and Multidimensional Scaling. Cluster Analysis differs from PCA in that the goal is to detect similarities between samples and dene groups in the data [3], while Multidimensional Scaling (MDS) takes into account the mutual rela- tionships of sample distances to reproduce the data structure encoded in the distance (similarity) matrix into a low-dimensional space [4]. This work deals with the presentation of the PCA toolbox for MATLAB, which is a collection of MATLAB modules freely available via Internet from the Milano Chemometrics and QSAR Research Group website [5]. The toolbox was developed in order to calculate PCA, Cluster Analysis, and MDS in an easy-to-use graphical user interface (GUI) environment. It does not require an experienced user, but a basic knowledge on the underlying methods is necessary to correctly interpret the results. The PCA toolbox for MATLAB provides comprehensive results of PCA, besides the usual outputs, as well as different methods to estimate the optimal number of signicant components. Therefore, the originality of this manuscript is not related to methods implemented in the tool- box, but in the fact that the entire workow can be done by means of a graphical user interface (GUI). There is no need to give instructions to the MATLAB command line and all steps of analysis (data loading, univariate data screening, component selection, model calculation, model analysis, projection of new samples) can be handled in the GUI with an easy-to-use interface. This is an important aspect, since tool- boxes and software usually miss a graphical interface and this can lead users (especially beginners of chemometrics or MATLAB) to not use them, even if the underlying model is a basic multivariate method. Moreover, some already available toolboxes are black boxes without ap- parent detailed description of the options related to the underlying models, while a comprehensive help is provided with the PCA toolbox for MATLAB, describing both theory, options, and examples of the calcu- lated models. In the rst part of the paper, the theory of methods included in the toolbox, is briey overviewed. Then, the MATLAB modules and their fea- tures are described, and nally, the results obtained on a real chemical dataset are shown, as an example of application. Chemometrics and Intelligent Laboratory Systems 149 (2015) 19 Dept. of Earth and Environmental Sciences, University of MilanoBicocca, P.zza della Scienza, 120126 Milano, Italy. Tel.: +39 02 64482818. E-mail address: davide.ballabio@unimib.it. http://dx.doi.org/10.1016/j.chemolab.2015.10.003 0169-7439/© 2015 Elsevier B.V. All rights reserved. Contents lists available at ScienceDirect Chemometrics and Intelligent Laboratory Systems journal homepage: www.elsevier.com/locate/chemolab