Credit scoring using the clustered support vector machine Terry Harris ⇑ Credit Research Unit, Department of Management Studies, The University of the West Indies, Cave Hill Campus, P.O. Box 64, Barbados article info Article history: Available online 6 September 2014 Keywords: Credit risk Credit scoring Clustered support vector machine Support vector machine abstract This work investigates the practice of credit scoring and introduces the use of the clustered support vec- tor machine (CSVM) for credit scorecard development. This recently designed algorithm addresses some of the limitations noted in the literature that is associated with traditional nonlinear support vector machine (SVM) based methods for classiﬁcation. Speciﬁcally, it is well known that as historical credit scoring datasets get large, these nonlinear approaches while highly accurate become computationally expensive. Accordingly, this study compares the CSVM with other nonlinear SVM based techniques and shows that the CSVM can achieve comparable levels of classiﬁcation performance while remaining relatively cheap computationally. Ó 2014 Elsevier Ltd. All rights reserved. 1. Introduction In recent years, credit risk assessment has attracted signiﬁcant attention from managers at ﬁnancial institutions around the world. This increased interest has been in no small part caused by the weaknesses of existing risk management techniques that have been revealed by the recent ﬁnancial crisis and the growing demand for consumer credit (Wang, Yan, & Zhang, 2011). Address- ing these concerns, over past decades credit scoring has become increasingly important as ﬁnancial institutions move away from the traditional manual approaches to this more advanced method, which entails the building of complex statistical models (Huang, Chen, & Wang, 2007; Zhou, Lai, & Yu, 2010). Many of the statistical methods used to build credit scorecards are based on traditional classiﬁcation techniques such as logistic regression or discriminant analysis. However, in recent times non-linear approaches, 1 such as the kernel support vector machine, have been applied to credit scoring. These methods have helped to increase the accuracy and reliability of many credit scorecards (Bellotti & Crook, 2009; Yu, 2008). Nevertheless, despite these advances credit analyst at ﬁnancial institutions are pressed to continually pursue improvements in classiﬁer performance in an attempt to mitigate the credit risk faced by their institutions. However, many of the improvements in classiﬁer performances remain unreported due to the proprietary nature of industry led credit scoring research which attempts to ﬁnd more efﬁcient and effective algorithms. In the wider research community, the recent vintages of non-linear classiﬁers (e.g. the kernel support vector machine) have received a lot of attention and have been critiqued for, inter alia, their large time complexities. In fact the best-known time complexity for training a kernel based support vector machine is still quadratic (Bordes, Ertekin, Weston, & Bottou, 2005). As a result, when applied to credit scoring substantial computational resources are consumed when training on reasonably sized real world datasets. Accordingly, efforts to develop and apply new classiﬁers to credit scoring, which are capable of separating nonlin- ear data while remaining relatively inexpensive computationally, are well placed. This paper investigates the suitability for credit scoring of a recently developed support vector machine based algorithm that has been proposed by Gu and Han (2013). Their clustered support vector machine has been shown to offer comparable performance to kernel based approaches while remaining cheap in terms of computational time. Furthermore, this study makes some novel adjustments to their implementation and explores the use of radius basis function (RBF) kernels in addition to the linear kernel posited by Gu and Han. The remainder of this paper is presented as follows. Section 2 outlines a brief review of the literature concerning the ﬁeld of credit scoring and sets the stage for the proposed CVSM model for credit scoring that is presented in Section 3. The details of the historic clients’ loan dataset and modeling method are highlighted in Section 4. Section 5 presents the study results, and Section 6 dis- cusses the ﬁndings, presents conclusions, and outlines possible directions for future research. http://dx.doi.org/10.1016/j.eswa.2014.08.029 0957-4174/Ó 2014 Elsevier Ltd. All rights reserved. ⇑ Tel.: +1 (246) 417 4302; fax: +1 (246) 438 9167. E-mail address: terry.harris@cavehill.uwi.edu 1 This has been applied because credit-scoring data is often not linearly separable. Expert Systems with Applications 42 (2015) 741–750 Contents lists available at ScienceDirect Expert Systems with Applications journal homepage: www.elsevier.com/locate/eswa