JIEMS Journal of Industrial Engineering and Management Studies Vol. 5, No. 2, 2018, pp. 1-12 DOI: 10.22116/JIEMS.2018.80681 www.jiems.icms.ac.ir Investigating the missing data effect on credit scoring rule based models: The case of an Iranian bank Seyed Mahdi Sadatrasoul 1,* , Zeynab Hajimohammadi 2 Abstract Credit risk management is a process in which banks estimate probability of default (PD) for each loan applicant. Data sets of previous loan applicants are built by gathering their data, and these internal data sets are usually completed using external credit bureau’s data and finally used for estimating PD in banks. There is also a continuous interest for bank to use rule based classifiers to build their default prediction models. However, in practice the data records are usually incomplete and have some missing values and this make problems for banks, especially in credit risk portfolios which are low default and makes model rule based building complex. Several strategies could be used in order to handle the missing data issue. This paper used five missing value handling strategies including; ignoring, replacing with random, mean, C&R tree induced values and elimination strategies in a real credit scoring dataset. Experimental results show that ignoring strategy consistently outperforms other methods on test data set, and suggest that the CHAID is a useful classifier for handling low default portfolios with missing value. Keywords: Credit Scoring; Banking Industry; Rule extraction; Missing data; Low default portfolio. Received: March 2018-29 Revised: June 2018-14 Accepted: November 2018-26 1. Introduction Credit scoring is used in banking industry from the past decades and competitive pressures of the industry makes sophisticated scoring models as one of the main assets and sources of competitiveness for banks. It is used to answer one key question - what is the probability of default within a fixed period, usually one year. Data sets of previous loan applicants are built by gathering their data, and these internal data sets are usually completed using external credit bureau’s data and finally used for estimating PD in banks, using scoring software’ s in which the model is in the heart of them (Van Gestel and Baesens, 2009). There classification techniques in the credit scoring problems include intelligent and statistical techniques. Logistic Regression (LR) is one of the most favorite traditional * Corresponding author; sadatrasoul@gmail.com 1 Management school, Kharazmi University, Tehran, Iran. 2 Shahid Beheshti University, Tehran, Iran.