Bagging k-dependence probabilistic networks: An alternative powerful fraud detection tool Francisco Louzada a,⇑ , Anderson Ara b a Universidade de São Paulo, Instituto de Matemática e Ciências da Computação, São Carlos, SP, Brazil b Universidade Federal de São Carlos, Departamento de Estatística, São Carlos, SP, Brazil article info Keywords: Fraud detection Probabilistic networks Bayesian networks Classification models Bagging Predictive performance abstract Fraud is a global problem that has required more attention due to an accentuated expansion of modern technology and communication. When statistical techniques are used to detect fraud, whether a fraud detection model is accurate enough in order to provide correct classification of the case as a fraudulent or legitimate is a critical factor. In this context, the concept of bootstrap aggregating (bagging) arises. The basic idea is to generate multiple classifiers by obtaining the predicted values from the adjusted models to several replicated datasets and then combining them into a single predictive classification in order to improve the classification accuracy. In this paper, for the first time, we aim to present a pioneer study of the performance of the discrete and continuous k-dependence probabilistic networks within the context of bagging predictors classification. Via a large simulation study and various real datasets, we discovered that the probabilistic networks are a strong modeling option with high predictive capacity and with a high increment using the bagging procedure when compared to traditional techniques. Ó 2012 Elsevier Ltd. All rights reserved. 1. Introduction Fraud rates in various areas, such as financial, commercial, technological, internal accounting and others, have been growing in an accentuated manner with the expansion of modern technology and global communication (Bolton & Hand, 2002; Kou, Lu, Sirwongwattana, & Huang, 2004). According to the ESB (2011), there have been significant financial losses due to fraud in online business recently, which increased from US$5.2 billion in 2008 to US$8.6 billion in 2009. An effective methodology for fraud detection may help companies to offer their consumers a safe and reliable online environment, which encourages loyalty to their services. Therefore, it is essential that prevention technologies and fraud detection methods are developed and updated continuously, preventing ways to circumvent such measures. In this sense, a fraud detection involves identifying fraud cases as quickly as possible one it has been perpetrated (Bolton & Hand, 2002). There are statistical methods in the areas of Knowledge Discovery in Databases (KDD), Data Mining and Machine Learning with applicable and successful solutions in different areas of fraud crimes. These methods use a database of cases with information type fraudulent or legitimate to build a model that results in a score, usually called of suspected score, to predict new cases of fraud. Among these methods, we mention the traditional statistical classification methods, such as logistic regression, probit regres- sion and discriminant analysis, more powerful tools, such as neural nets and rule-based algorithms (Bolton & Hand, 2002; Hand, 1981; Ngai, Y Hu, Chen, & Sun, 2011; Ripley, 1996). Some papers have addressed the theory and the application of these tools. Wilson (2009) and Maranzato, Pereira, Naubert, and Lago (2010) used the logistic regression method as a tool to dis- criminate fraudulent actions from legitimate actions for insurance companies and e-commerce. Field and Hobson (1997) present a neural network based fraud management technique based on pro- filing techniques. Fawcett and Provost (1997) present a rule-based tool for fraud detection using a series of machine learning methods. Alternatively, the method of probabilistic networks introduced by Pearl (1988) and disseminated in literature by the name of Bayesian networks, also known as causal networks, belief network or probabilistic dependence graphic, emerged in the 80’s and has been applied in a wide variety of real-world activities (Bobbio, Portinale, Minichino, & Ciancarmela, 2001). The method is based on conditional probability distributions between variables and their causal relationship. Geiger and Heckerman (1994), Sahami (1996), Cheng, Bell, and Liu (1997) and Friedman, Geiger, and Goldszmidt (1997) suggest classification models based on probabilistic network structures, such as naive Bayes networks, tree augmented networks and 0957-4174/$ - see front matter Ó 2012 Elsevier Ltd. All rights reserved. http://dx.doi.org/10.1016/j.eswa.2012.04.024 ⇑ Corresponding author. E-mail address: louzada@icmc.usp.br (F. Louzada). Expert Systems with Applications 39 (2012) 11583–11592 Contents lists available at SciVerse ScienceDirect Expert Systems with Applications journal homepage: www.elsevier.com/locate/eswa