1 GANgster: A Fraud Review Detector based on Regulated GAN with Data Augmentation Saeedreza Shehnepoor*, Roberto Togneri, Wei Liu, Mohammed Bennamoun Abstract—Financial implications of written reviews provide great incentives for businesses to pay fraudsters to write or use bots to generate fraud reviews. The promising performance of Deep Neural Networks (DNNs) in text classification, has attracted research to use them for fraud review detection. However, the lack of trusted labeled data has limited the performance of the current solutions in detecting fraud reviews. Unsupervised and semi-supervised methods are among the most applicable methods to deal with the data scarcity problem. Generative Adversarial Network (GAN) as a semi-supervised method has demonstrated to be effective for data augmentation purposes. The state-of-the-art solution utilizes GAN to overcome the data limitation problem. However, it fails to incorporate the behavioral clues in both fraud generation and detection. Besides, the state- of-the-art approach suffers from a common limitation in the training convergence of the GAN, slowing down the training procedure. In this work, we propose a regularised GAN for fraud review detection that makes use of both review text and review rating scores. Scores are incorporated through Informa- tion Gain Maximization in to the loss function for two reasons. One is to generate near-authentic and more human like score- correlated reviews. The other is to improve the stability of the GAN. Experimental results have shown better convergence of the regulated GAN. In addition, the scores are also used in combination with word embeddings of review text as input for the discriminators for better performance. Results show that the proposed framework relatively outperformed existing state-of- the-art framework; namely FakeGAN; in terms of AP by 7%, and 5% on the Yelp and TripAdvisor datasets, respectively. Index Terms—fraud reviews detection, deep learning, gener- ative adversarial networks, multi attribute, Information Gain Maximization. I. I NTRODUCTION Social media is full of users’ opinion about matters such as news, personal events, advertisements, and businesses. Opin- ions concerning businesses can greatly influence the users’ decisions on purchasing certain products or services. A study in 2015 demonstrated that about 70 percent of people in the US, visit other users’ reviews for a product, before purchasing 1 . The openness of popular review platforms (Amazon, eBay, TripAdvisor, Yelp, etc.) provide an opportunity for marketers to promote their own business or defame their competitors, by deploying new techniques such as bots, or hiring humans to S. Shehnepoor (*corresponding author) is with the University of Western Australia, Perth, Australia. R. Togneri is with the University of Western Australia, Perth, Australia. M. Buneman is with the University of Western Australia, Perth, Australia. W. Liu is with the University of Western Australia, Perth, Australia. emails: {saeedreza.shehnepoor@research.uwa.edu.au, roberto.togneri@uwa.edu.au, wei.liu@uwa.edu.au, mohammed.bennamoun@uwa.edu.au.} 1 https://www.mintel.com/press-centre/social-and-lifestyle/ seven-in-10-americans-seek-out-opinions-before-making-purchases write fraud reviews for them. The reviews produced in this way are called “Fraud Reviews” [1, 2, 3]. Studies show that fraud reviews increased in Yelp by 5% to 25% [4] from 2005-2016. It is worth mentioning that there are fraud contents in different contexts of social media with the same characteristics [5]. Fake news consists of articles intentionally written to convey false information for a variety of purposes such as financial or political manipulation [6, 7]. There has to be enough knowledge of political science, journalism, psychology, etc. to study these types of contents generated in social media [8, 9]. Since the first work on social fraud reviews in 2008 by Jindal et al. in [10], many approaches were used to address this problem, including text based features which refer to those extracted from text [11] such as language models [12], or behavioral ones which extract behavioral clues from users’ behavior pattern using metadata or users’ profile [13]. These approaches can also be combined for better performance [12, 14]. Hand crafted features are fed to classifiers such as the Multi-Layer Perceptron (MLP), Naive Bayes, Support Vector Machines (SVMs) to predict if a review is genuine or not. We call these approaches using hand-crafted feature “classic approaches”. Recent years have seen Deep Learning (DL) used for fraud review detection and model it as a “text classification” task, for better feature representation, and to address the overfitting problem [15]. To deal with data scarcity, a recent attempt [16] adopted GAN in a framework called “FakeGAN”. FakeGAN consisted of a generator to generate fake samples as auto-generated reviews and two discrimina- tors. One for discriminating between fake and real samples and the other one for discriminating fraud human reviews and fraud generated ones. Despite FakeGAN’s simplicity and effectiveness, it suffers from major limitations. The first limitation is the lack of high quality score-correlated data. Reviews generated by FakeGAN contain text and provide no metadata such as score, which has shown to be more useful than text reviews when it comes to fraud detection [12, 14]. Generating high quality data correlated with the score provides a better feature representation learned jointly from both text and metadata. Second, FakeGAN suffers from the lack of stability in the training step. In better words, the training procedure in FakeGAN takes time to stabilize. Regularizing the objective function is one way to ensure the convergence of GAN. Finally, the performance of FakeGAN was evaluated only on one dataset, and has not been tested on other datasets. Experiments on datasets from different domains are required to ensure the scalability of the proposed approach. In this paper, we propose to use Generative Adversarial Networks (GANs) [17] in our framework to solve the data arXiv:2006.06561v1 [cs.LG] 11 Jun 2020