Software Description
A MATLAB toolbox for Principal Component Analysis and unsupervised
exploration of data structure
Davide Ballabio ⁎
Milano Chemometrics and QSAR Research Group, Department of Earth and Environmental Sciences, University of Milano–Bicocca, Milano, Italy
abstract article info
Article history:
Received 8 September 2015
Received in revised form 16 September 2015
Accepted 3 October 2015
Available online 19 October 2015
Keywords:
Principal Component Analysis
Rank analysis
Cluster Analysis
Multidimensional Scaling
MATLAB
Principal Component Analysis is a multivariate method to project data in a reduced hyperspace, defined by
orthogonal principal components, which are linear combinations of the original variables. In this way, data di-
mension can be reduced, noise can be excluded from the subsequent analysis, and therefore, data interpretation
is extremely facilitated. For these reasons, Principal Component Analysis is nowadays the most common chemo-
metric strategy for unsupervised exploratory data analysis.
In this paper, the PCA toolbox for MATLAB is described. This is a collection of modules for calculating Principal
Component Analysis, as well as Cluster Analysis and Multidimensional Scaling, which are two other well-
known multivariate methods for unsupervised data exploration. The toolbox is freely available via Internet and
comprises a graphical user interface (GUI), which allows the calculation in an easy-to-use graphical environment.
It aims to be useful for both beginners and advanced users. The use of the toolbox is discussed here with an
appropriate practical example.
© 2015 Elsevier B.V. All rights reserved.
1. Introduction
Principal Component Analysis (PCA) is a well-known chemometric
technique for exploratory data analysis; it basically projects data in a re-
duced hyperspace, defined by orthogonal principal components [1,2].
These are linear combinations of the original variables, with the first
principal component having the largest variance, the second principal
component having the second largest variance, and so on. It is thus
possible to select a number of significant components, so that data di-
mension is reduced by preserving the systematic variation in the data
retained in the first selected components, while noise is excluded,
being represented in the last components. Therefore, PCA enhances
and facilitates data exploration and interpretation of multivariate
datasets.
In addition to PCA, two other common chemometric strategies for
unsupervised data analysis are Cluster Analysis and Multidimensional
Scaling. Cluster Analysis differs from PCA in that the goal is to detect
similarities between samples and define groups in the data [3], while
Multidimensional Scaling (MDS) takes into account the mutual rela-
tionships of sample distances to reproduce the data structure encoded
in the distance (similarity) matrix into a low-dimensional space [4].
This work deals with the presentation of the PCA toolbox for
MATLAB, which is a collection of MATLAB modules freely available via
Internet from the Milano Chemometrics and QSAR Research Group
website [5]. The toolbox was developed in order to calculate PCA,
Cluster Analysis, and MDS in an easy-to-use graphical user interface
(GUI) environment. It does not require an experienced user, but a
basic knowledge on the underlying methods is necessary to correctly
interpret the results.
The PCA toolbox for MATLAB provides comprehensive results of PCA,
besides the usual outputs, as well as different methods to estimate the
optimal number of significant components. Therefore, the originality
of this manuscript is not related to methods implemented in the tool-
box, but in the fact that the entire workflow can be done by means of
a graphical user interface (GUI). There is no need to give instructions
to the MATLAB command line and all steps of analysis (data loading,
univariate data screening, component selection, model calculation,
model analysis, projection of new samples) can be handled in the GUI
with an easy-to-use interface. This is an important aspect, since tool-
boxes and software usually miss a graphical interface and this can lead
users (especially beginners of chemometrics or MATLAB) to not use
them, even if the underlying model is a basic multivariate method.
Moreover, some already available toolboxes are black boxes without ap-
parent detailed description of the options related to the underlying
models, while a comprehensive help is provided with the PCA toolbox
for MATLAB, describing both theory, options, and examples of the calcu-
lated models.
In the first part of the paper, the theory of methods included in the
toolbox, is briefly overviewed. Then, the MATLAB modules and their fea-
tures are described, and finally, the results obtained on a real chemical
dataset are shown, as an example of application.
Chemometrics and Intelligent Laboratory Systems 149 (2015) 1–9
⁎ Dept. of Earth and Environmental Sciences, University of Milano–Bicocca, P.zza della
Scienza, 1–20126 Milano, Italy. Tel.: +39 02 64482818.
E-mail address: davide.ballabio@unimib.it.
http://dx.doi.org/10.1016/j.chemolab.2015.10.003
0169-7439/© 2015 Elsevier B.V. All rights reserved.
Contents lists available at ScienceDirect
Chemometrics and Intelligent Laboratory Systems
journal homepage: www.elsevier.com/locate/chemolab