Partial Least Square Discriminant Analysis for bankruptcy prediction Carlos Serrano-Cinca , Begoña Gutiérrez-Nieto Department of Accounting and Finance, University of Zaragoza, Spain abstract article info Article history: Received 29 June 2011 Received in revised form 13 September 2012 Accepted 25 November 2012 Available online 3 December 2012 Keywords: Bankruptcy Financial ratios Banking crisis Solvency Data mining PLS-DA This paper uses Partial Least Square Discriminant Analysis (PLS-DA) for the prediction of the 2008 USA banking crisis. PLS regression transforms a set of correlated explanatory variables into a new set of uncorrelated variables, which is appropriate in the presence of multicollinearity. PLS-DA performs a PLS regression with a dichotomous dependent variable. The performance of this technique is compared to the performance of 8 algorithms widely used in bankruptcy prediction. In terms of accuracy, precision, F-score, Type I error and Type II error, results are similar; no algorithm outperforms the others. Behind performance, each algorithm assigns a score to each bank and classies it as solvent or failed. These results have been analyzed by means of contingency tables, cor- relations, cluster analysis and reduction dimensionality techniques. PLS-DA results are very close to those obtained by Linear Discriminant Analysis and Support Vector Machine. © 2012 Elsevier B.V. All rights reserved. 1. Introduction Bankruptcy prediction from nancial ratios using mathematical models is a classical approach in data mining research. Since Beaver's [7] pioneer work, based on univariate ratio analysis, many different techniques have been employed in this context. Altman [3] used Linear Discriminant Analysis (LDA); Ohlson [42] used Logistic Regression (LR); Marais et al. [35] used Decision Trees such as Id3, C4.5 and Random Trees; Tam and Kiang [57] used Multilayer Perceptron (MLP), a neural network model and K-Nearest Neighbors (KNN); Serrano-Cinca [54] and du Jardin and Séverin [18] applied Self Organizing Feature Maps; Fan and Palaniswami [21] used Support Vector Machine (SVM) and Sarkar and Sriram [52] applied Naive Bayes (NB). Techniques of ensem- bles, such as Boosting or Bagging, have been applied by Foster and Stine [26], who combined C4.5 and Boosting; while Mukkamala et al. [39] combined Bagging and Random Tree (BRT). See Olson et al. [44] for a re- cent comparative analysis on data mining methods for bankruptcy pre- diction. This paper applies Partial Least Square Discriminant Analysis (PLS-DA) to the 2008 banking crisis in the USA. To the best of our knowledge, this technique has not previously been applied to bankrupt- cy prediction. Partial Least Squares (PLS) regression combines features from Prin- cipal Component Analysis (PCA) and Multiple Linear Regression [60]. PLS is a mathematical estimation approach that builds a model by sequentially adding data points so that model parameters are continu- ously updated. PLS models are popular in structural model building and in regression analysis. PLS-DA is based on the PLS regression model, being the dependent variable a categorical one. This approach is useful for classication tasks [6]. For example, PLS-DA is a standard tool in Chemometrics, the science that analyzes chemical data [62]. Its attraction resides in its ability to successfully address the problem of multicollinearity [6,59]. Multicollinearity is a major problem when building models based on nancial data. Financial analysts request tools able to accurately predict distress from nancial data. But they also want to model bankruptcy symptoms, by identifying the relevant variables. However, it is difcult to select an appropriate model when using collinear data, as there is no unique data reduction method, and different orderings of the hypothe- sis testing procedure result in different models, something that affects interpretability. Multicollinearity and model selection procedures in re- gression have long been debated in Econometrics; see for example Hendry and Mizon [28]. Only when regressors are orthogonal, any model selection procedure ends up by identifying the same model. This is certainly not the case with nancial data. A possible solution is to do regression analysis on principal components, which are orthogo- nal by denition, but this always results in a loss of information in the data set since only a small number of components are employed in the distress prediction model. In this paper we consider if the PLS-DA methodology offers a possible way forward in this context. The paper poses two research questions: RQ1: How does PLS-DA perform in terms of model interpretability, when facing correlated data, compared to other techniques? The case of multicollinearity in nancial informa- tion will be specially studied. RQ2: How does PLS-DA perform in terms of classication accuracy, compared to other techniques? Decision Support Systems 54 (2013) 12451255 Corresponding author at: Department of Accounting and Finance, Fac. Economía y Empresa, Univ. Zaragoza, Gran Vía 2, Zaragoza (50.005), Spain. Tel.: +34 876 554643; fax: +34 976 761769. E-mail address: serrano@unizar.es (C. Serrano-Cinca). URL: http://ciberconta.unizar.es/charles.htm (C. Serrano-Cinca). 0167-9236/$ see front matter © 2012 Elsevier B.V. All rights reserved. http://dx.doi.org/10.1016/j.dss.2012.11.015 Contents lists available at SciVerse ScienceDirect Decision Support Systems journal homepage: www.elsevier.com/locate/dss