On the impact of disproportional samples in credit scoring models: An application to a Brazilian bank data Francisco Louzada a,⇑ , Paulo H. Ferreira-Silva b , Carlos A.R. Diniz b a Universidade de São Paulo, SME-ICMC, São Carlos, Brazil b Universidade Federal de São Carlos, DEs, São Carlos, Brazil article info Keywords: Classification models Naive logistic regression Logistic regression with state-dependent sample selection Performance measures Credit scoring abstract Statistical methods have been widely employed to assess the capabilities of credit scoring classification models in order to reduce the risk of wrong decisions when granting credit facilities to clients. The pre- dictive quality of a classification model can be evaluated based on measures such as sensitivity, specific- ity, predictive values, accuracy, correlation coefficients and information theoretical measures, such as relative entropy and mutual information. In this paper we analyze the performance of a naive logistic regression model (Hosmer & Lemeshow, 1989) and a logistic regression with state-dependent sample selection model (Cramer, 2004) applied to simulated data. Also, as a case study, the methodology is illus- trated on a data set extracted from a Brazilian bank portfolio. Our simulation results so far revealed that there is no statistically significant difference in terms of predictive capacity between the naive logistic regression models and the logistic regression with state-dependent sample selection models. However, there is strong difference between the distributions of the estimated default probabilities from these two statistical modeling techniques, with the naive logistic regression models always underestimating such probabilities, particularly in the presence of balanced samples. Ó 2012 Elsevier Ltd. All rights reserved. 1. Introduction The proper classification of applicants is of vital importance for determining the granting of credit facilities. Historically, statistical classification models have been used by financial institutions as a major tool to help on granting credit to clients. The consolidation of the use of classification models occurred in the 90s, when changes in the world scene, such as deregulation of interest rates and exchange rates, increase in liquidity and in bank competition, made financial institutions more and more worried about credit risk, i.e., the risk they were running when accepting someone as their client. The granting of credit started to be more important in the profitability of companies in the financial sector, becoming one of the main sources of revenue for banks and finan- cial institutions in general. Due to this fact, this sector of the econ- omy realized that it was highly recommended to increase the amount of allocated resources without losing the agility and qual- ity of credits, at which point the contribution of statistical modeling is essential. Classification models for credit scoring are based on databases of relevant client information, with the financial performance of clients evaluated from the time when the client–company relationship began as a dichotomic classification. The goal of credit scoring models is to classify loan clients to either good credit or bad credit (Lee, Chiu, Lu, & Chen, 2002), predicting the bad payers (Lim & Sohn, 2007). In this context, discriminant analysis, regression trees, logistic regression, logistic regression with state-dependent sample selec- tion and neural networks are among the most widely used classi- fication models. In fact, logistic regression is still very used in building and developing credit scoring models (Caouette, Altman, & Narayanan, 1998; Desai, Crook, & Overstreet, 1996; Hand & Henley, 1997; Sarlija, Bensic, & Bohacek, 2004). Generally, the best technique for all data sets does not exist but we can compare a set of methods using some statistical criteria. Therefore, the main thrust of this paper is to investigate and compare the performance of the naive logistic regression (Hosmer & Lemeshow, 1989) and the logistic regression with state-dependent sample selection (Cramer, 2004) using performance measures, in terms of a simula- tion study. The idea is to analyze the impact of disproportional samples on credit scoring models. Logistic regression with state- dependent sample selection is a statistical modeling technique used in cases where the sample considered to develop a model, i.e. the selected sample, contains only a portion, usually small, of the individuals who make up one of two study groups, in general the most frequent group. In credit scoring, for instance, the group of good payers is expected to be the predominant group. In short, this recent technique makes a correction in the estimated default probability from a naive logistic regression model (Cramer, 2004). 0957-4174/$ - see front matter Ó 2012 Elsevier Ltd. All rights reserved. doi:10.1016/j.eswa.2012.01.134 ⇑ Corresponding author. Tel.: +55 16 3373 6614. E-mail address: louzada@icmc.usp.br (F. Louzada). Expert Systems with Applications 39 (2012) 8071–8078 Contents lists available at SciVerse ScienceDirect Expert Systems with Applications journal homepage: www.elsevier.com/locate/eswa