Semi-supervised GANs for Fraud Detection * Charitos Charitou Department of Computer Science City, University of London London, UK charitos.charitou@city.ac.uk Artur d’Avila Garcez Department of Computer Science City, University of London London, UK a.garcez@city.ac.uk Simo Dragicevic BetBuddy Playtech Plc London, UK simo.dragicevic@playtech.com Abstract—Over the years the online gambling industry has evolved into one of the most profitable industries on the Internet. At the same time, new stringent regulations have required the online industry to become a lot more vigilant. Although standards have improved, the methods used to process finance from illicit activities also evolved and became more sophisticated. Detecting these fraudulent activities in real life with high accuracy requires a learning system to be trained with balanced data sets of fraudulent and normal transactions. However, in the real-world, the number of fraudulent cases is significantly lower than normal cases. In this paper, to deal with data imbalance, we propose a novel generative adversarial framework based on semi- supervised learning of sparse auto-encoders for the detection of fraud in online gambling. Experimental results show that the proposed framework outperforms mainstream discriminative techniques such as logistic regression, random forest and multi- layer perceptron. We validate further the approach by applying it to other domains that suffer from the problem of class imbalance obtaining promising results. Index Terms—Fraud detection, Imbalanced data, Semi- supervised Generative Adversarial Networks, Sparse Auto- encoders. I. I NTRODUCTION Fraud detection refers to the identification of illegal ac- tivities occurring in numerous industries such as finance, gambling, insurance or cybersecurity. If fraudulent behaviour is not monitored and prevented then it can have catastrophic consequences such as the financing of terrorism. Many orga- nizations have been interested in the immediate detection of illicit activities, aiming to prevent losses, while also ensuring the safety of their customers [1]. This research is part of a collaboration with a major gambling operator. The purpose of the research is to explore the use of deep learning to strengthen processes used in the detection of suspicious gambling behaviour, in particular money laundering. In the UK, gambling firms have paid over £40 million in fines and settlements since 2017 with all major cases involving failings in detecting money laundering. Until recently, the gambling industry has tackled the iden- tification of money laundering in online gambling primarily by using knowledge-based systems. Whilst capable of easily embedding regulatory requirements which have focused on simple thresholds, these systems are unable to adapt to new requirements to proactively monitor the activity of millions of online customers and a changing malicious behaviour related to criminal activity online. In fraud detection problems, the fraudulent cases tend to be far fewer than the non-fraudulent ones (referred to in the literature as an ’imbalanced data set’), which leads to difficulties in the training of classification algorithms. In most cases, such algorithms seek to maximize accuracy and as a result become biased towards the majority class. Classification models, such as logistic regression (LR), ran- dom forest (RF), multi-layer perceptron (MLP), are typically discriminative models, i.e. via the use of a certain feature set, they try to select the most appropriate class. This is, essentially, the root cause of the problem of the bias caused by the data imbalance, as the algorithm does not have a notion of ’how’ the data are produced, yet it focuses on the objective measure of discrimination (e.g. accuracy). A way of alleviating this problem is to use models that aim to also understand the un- derlying generative process, as done for example by generative networks. Gaussian Mixture Models (GMMs) have formed the backbone of a variety of generative models, including Hidden Markov Models, employed with this objective [2], yet they come with Gaussian distribution assumptions and require much effort to be deployed in classification problems. Such models have been used together with clustering techniques to provide the required classification algorithm [3]. Recently, Generative Adversarial Networks (GANs) allowed for a more generic approach with the advantages of combining end-to-end both generative and discriminative techniques. By extending the traditional framework of GANs to allow for the discriminator to perform classification [4], semi-supervised GANs (SSGANs) have shown potential in the recent literature particularly at learning from unstructured data such as images or sound [5]. Nevertheless, research regarding the application of GANs to structured data has been very limited. In this paper, we argue that semi-supervised GANs can provide a powerful and versatile framework for tackling su- pervised learning from imbalanced and sparse structured data. We validate this claim empirically by applying SSGANs to different domains suffering from the same data imbalance difficulty. We conduct experiments on the benchmark data sets for Credit Card Fraud, Breast Cancer Wisconsin and Pima Diabetes. Finally, we apply the proposed semi-supervised framework on a real-world Gambling Fraud Detection data set which is related with money laundering. We compare 978-1-7281-6926-2/20/$31.00 ©2020 IEEE