Partial Least Square Discriminant Analysis for bankruptcy prediction
Carlos Serrano-Cinca ⁎, Begoña Gutiérrez-Nieto
Department of Accounting and Finance, University of Zaragoza, Spain
abstract article info
Article history:
Received 29 June 2011
Received in revised form 13 September 2012
Accepted 25 November 2012
Available online 3 December 2012
Keywords:
Bankruptcy
Financial ratios
Banking crisis
Solvency
Data mining
PLS-DA
This paper uses Partial Least Square Discriminant Analysis (PLS-DA) for the prediction of the 2008 USA banking
crisis. PLS regression transforms a set of correlated explanatory variables into a new set of uncorrelated variables,
which is appropriate in the presence of multicollinearity. PLS-DA performs a PLS regression with a dichotomous
dependent variable. The performance of this technique is compared to the performance of 8 algorithms widely
used in bankruptcy prediction. In terms of accuracy, precision, F-score, Type I error and Type II error, results
are similar; no algorithm outperforms the others. Behind performance, each algorithm assigns a score to each
bank and classifies it as solvent or failed. These results have been analyzed by means of contingency tables, cor-
relations, cluster analysis and reduction dimensionality techniques. PLS-DA results are very close to those
obtained by Linear Discriminant Analysis and Support Vector Machine.
© 2012 Elsevier B.V. All rights reserved.
1. Introduction
Bankruptcy prediction from financial ratios using mathematical
models is a classical approach in data mining research. Since Beaver's
[7] pioneer work, based on univariate ratio analysis, many different
techniques have been employed in this context. Altman [3] used Linear
Discriminant Analysis (LDA); Ohlson [42] used Logistic Regression (LR);
Marais et al. [35] used Decision Trees such as Id3, C4.5 and Random
Trees; Tam and Kiang [57] used Multilayer Perceptron (MLP), a neural
network model and K-Nearest Neighbors (KNN); Serrano-Cinca [54]
and du Jardin and Séverin [18] applied Self Organizing Feature Maps;
Fan and Palaniswami [21] used Support Vector Machine (SVM) and
Sarkar and Sriram [52] applied Naive Bayes (NB). Techniques of ensem-
bles, such as Boosting or Bagging, have been applied by Foster and Stine
[26], who combined C4.5 and Boosting; while Mukkamala et al. [39]
combined Bagging and Random Tree (BRT). See Olson et al. [44] for a re-
cent comparative analysis on data mining methods for bankruptcy pre-
diction. This paper applies Partial Least Square Discriminant Analysis
(PLS-DA) to the 2008 banking crisis in the USA. To the best of our
knowledge, this technique has not previously been applied to bankrupt-
cy prediction.
Partial Least Squares (PLS) regression combines features from Prin-
cipal Component Analysis (PCA) and Multiple Linear Regression [60].
PLS is a mathematical estimation approach that builds a model by
sequentially adding data points so that model parameters are continu-
ously updated. PLS models are popular in structural model building
and in regression analysis. PLS-DA is based on the PLS regression
model, being the dependent variable a categorical one. This approach
is useful for classification tasks [6]. For example, PLS-DA is a standard
tool in Chemometrics, the science that analyzes chemical data [62]. Its
attraction resides in its ability to successfully address the problem of
multicollinearity [6,59].
Multicollinearity is a major problem when building models based on
financial data. Financial analysts request tools able to accurately predict
distress from financial data. But they also want to model bankruptcy
symptoms, by identifying the relevant variables. However, it is difficult
to select an appropriate model when using collinear data, as there is no
unique data reduction method, and different orderings of the hypothe-
sis testing procedure result in different models, something that affects
interpretability. Multicollinearity and model selection procedures in re-
gression have long been debated in Econometrics; see for example
Hendry and Mizon [28]. Only when regressors are orthogonal, any
model selection procedure ends up by identifying the same model.
This is certainly not the case with financial data. A possible solution is
to do regression analysis on principal components, which are orthogo-
nal by definition, but this always results in a loss of information in the
data set since only a small number of components are employed in
the distress prediction model. In this paper we consider if the PLS-DA
methodology offers a possible way forward in this context. The paper
poses two research questions: RQ1: How does PLS-DA perform in
terms of model interpretability, when facing correlated data, compared
to other techniques? The case of multicollinearity in financial informa-
tion will be specially studied. RQ2: How does PLS-DA perform in
terms of classification accuracy, compared to other techniques?
Decision Support Systems 54 (2013) 1245–1255
⁎ Corresponding author at: Department of Accounting and Finance, Fac. Economía y
Empresa, Univ. Zaragoza, Gran Vía 2, Zaragoza (50.005), Spain. Tel.: +34 876 554643;
fax: +34 976 761769.
E-mail address: serrano@unizar.es (C. Serrano-Cinca).
URL: http://ciberconta.unizar.es/charles.htm (C. Serrano-Cinca).
0167-9236/$ – see front matter © 2012 Elsevier B.V. All rights reserved.
http://dx.doi.org/10.1016/j.dss.2012.11.015
Contents lists available at SciVerse ScienceDirect
Decision Support Systems
journal homepage: www.elsevier.com/locate/dss