Semi-supervised GANs for Fraud Detection
*
Charitos Charitou
Department of Computer Science
City, University of London
London, UK
charitos.charitou@city.ac.uk
Artur d’Avila Garcez
Department of Computer Science
City, University of London
London, UK
a.garcez@city.ac.uk
Simo Dragicevic
BetBuddy
Playtech Plc
London, UK
simo.dragicevic@playtech.com
Abstract—Over the years the online gambling industry has
evolved into one of the most profitable industries on the Internet.
At the same time, new stringent regulations have required the
online industry to become a lot more vigilant. Although standards
have improved, the methods used to process finance from illicit
activities also evolved and became more sophisticated. Detecting
these fraudulent activities in real life with high accuracy requires
a learning system to be trained with balanced data sets of
fraudulent and normal transactions. However, in the real-world,
the number of fraudulent cases is significantly lower than
normal cases. In this paper, to deal with data imbalance, we
propose a novel generative adversarial framework based on semi-
supervised learning of sparse auto-encoders for the detection
of fraud in online gambling. Experimental results show that
the proposed framework outperforms mainstream discriminative
techniques such as logistic regression, random forest and multi-
layer perceptron. We validate further the approach by applying it
to other domains that suffer from the problem of class imbalance
obtaining promising results.
Index Terms—Fraud detection, Imbalanced data, Semi-
supervised Generative Adversarial Networks, Sparse Auto-
encoders.
I. I NTRODUCTION
Fraud detection refers to the identification of illegal ac-
tivities occurring in numerous industries such as finance,
gambling, insurance or cybersecurity. If fraudulent behaviour
is not monitored and prevented then it can have catastrophic
consequences such as the financing of terrorism. Many orga-
nizations have been interested in the immediate detection of
illicit activities, aiming to prevent losses, while also ensuring
the safety of their customers [1].
This research is part of a collaboration with a major
gambling operator. The purpose of the research is to explore
the use of deep learning to strengthen processes used in
the detection of suspicious gambling behaviour, in particular
money laundering. In the UK, gambling firms have paid over
£40 million in fines and settlements since 2017 with all major
cases involving failings in detecting money laundering.
Until recently, the gambling industry has tackled the iden-
tification of money laundering in online gambling primarily
by using knowledge-based systems. Whilst capable of easily
embedding regulatory requirements which have focused on
simple thresholds, these systems are unable to adapt to new
requirements to proactively monitor the activity of millions of
online customers and a changing malicious behaviour related
to criminal activity online.
In fraud detection problems, the fraudulent cases tend to
be far fewer than the non-fraudulent ones (referred to in
the literature as an ’imbalanced data set’), which leads to
difficulties in the training of classification algorithms. In most
cases, such algorithms seek to maximize accuracy and as a
result become biased towards the majority class.
Classification models, such as logistic regression (LR), ran-
dom forest (RF), multi-layer perceptron (MLP), are typically
discriminative models, i.e. via the use of a certain feature set,
they try to select the most appropriate class. This is, essentially,
the root cause of the problem of the bias caused by the data
imbalance, as the algorithm does not have a notion of ’how’
the data are produced, yet it focuses on the objective measure
of discrimination (e.g. accuracy). A way of alleviating this
problem is to use models that aim to also understand the un-
derlying generative process, as done for example by generative
networks. Gaussian Mixture Models (GMMs) have formed
the backbone of a variety of generative models, including
Hidden Markov Models, employed with this objective [2], yet
they come with Gaussian distribution assumptions and require
much effort to be deployed in classification problems. Such
models have been used together with clustering techniques to
provide the required classification algorithm [3].
Recently, Generative Adversarial Networks (GANs) allowed
for a more generic approach with the advantages of combining
end-to-end both generative and discriminative techniques. By
extending the traditional framework of GANs to allow for
the discriminator to perform classification [4], semi-supervised
GANs (SSGANs) have shown potential in the recent literature
particularly at learning from unstructured data such as images
or sound [5]. Nevertheless, research regarding the application
of GANs to structured data has been very limited.
In this paper, we argue that semi-supervised GANs can
provide a powerful and versatile framework for tackling su-
pervised learning from imbalanced and sparse structured data.
We validate this claim empirically by applying SSGANs to
different domains suffering from the same data imbalance
difficulty. We conduct experiments on the benchmark data sets
for Credit Card Fraud, Breast Cancer Wisconsin and Pima
Diabetes. Finally, we apply the proposed semi-supervised
framework on a real-world Gambling Fraud Detection data
set which is related with money laundering. We compare
978-1-7281-6926-2/20/$31.00 ©2020 IEEE