Citation: Ayoub, S.; Gulzar, Y.;
Rustamov, J.; Jabbari, A.; Reegu, F.A.;
Turaev, S. Adversarial Approaches to
Tackle Imbalanced Data in Machine
Learning. Sustainability 2023, 15, 7097.
https://doi.org/10.3390/su15097097
Academic Editor: Andreas Kanavos
Received: 20 February 2023
Revised: 1 April 2023
Accepted: 13 April 2023
Published: 24 April 2023
Copyright: © 2023 by the authors.
Licensee MDPI, Basel, Switzerland.
This article is an open access article
distributed under the terms and
conditions of the Creative Commons
Attribution (CC BY) license (https://
creativecommons.org/licenses/by/
4.0/).
sustainability
Article
Adversarial Approaches to Tackle Imbalanced Data in
Machine Learning
Shahnawaz Ayoub
1
, Yonis Gulzar
2,
* , Jaloliddin Rustamov
3
, Abdoh Jabbari
4
, Faheem Ahmad Reegu
4
and Sherzod Turaev
5,
*
1
Department of Computer Science and Engineering, Shri Venkateshwara University, NH-24,
Venkateshwara Nagar, Gajraula 244236, Uttar Pradesh, India; shahnawazayoub@outlook.com
2
Department of Management Information Systems, College of Business Administration, King Faisal University,
Al-Ahsa 31982, Saudi Arabia
3
Health Data Science Lab, Department of Genetics and Genomics, College of Medicine and Health Sciences,
United Arab Emirates University, Al Ain 15551, United Arab Emirates
4
Department of Computer Science and Information Technology, Jazan University, Jazan 45142, Saudi Arabia
5
Department of Computer Science & Software Engineering, College of Information Technology,
United Arab Emirates University, Al Ain 15551, United Arab Emirates
* Correspondence: ygulzar@kfu.edu.sa (Y.G.); sherzod@uaeu.ac.ae (S.T.); Tel.: +966-545-719-118 (Y.G.)
Abstract: Real-world applications often involve imbalanced datasets, which have different distribu-
tions of examples across various classes. When building a system that requires a high accuracy, the
performance of the classifiers is crucial. However, imbalanced datasets can lead to a poor classification
performance and conventional techniques, such as synthetic minority oversampling technique. As
a result, this study proposed a balance between the datasets using adversarial learning methods
such as generative adversarial networks. The model evaluated the effect of data augmentation on
both the balanced and imbalanced datasets. The study evaluated the classification performance
on three different datasets and applied data augmentation techniques to generate the synthetic
data for the minority class. Before the augmentation, a decision tree was applied to identify the
classification accuracy of all three datasets. The obtained classification accuracies were 79.9%, 94.1%,
and 72.6%. A decision tree was used to evaluate the performance of the data augmentation, and the
results showed that the proposed model achieved an accuracy of 82.7%, 95.7%, and 76% on a highly
imbalanced dataset. This study demonstrates the potential of using data augmentation to improve
the classification performance in imbalanced datasets.
Keywords: computer vision; machine learning; deep learning; imbalanced dataset
1. Introduction
Any artificial intelligence application is mainly dependent on data [1]. Due to its
numerous uses, AI has been incorporated in many areas such as healthcare [2–5], agri-
culture [6,7], multi-class image classification [8], image caption prediction [9], fake image
identification [10], and other purposes [11–13]. In the majority of real-world classification
applications, the training data shows a distribution with a long tail. It means that the
training data is spread out. This is because few classes are abundant whereas other classes
are limited [14,15]. Over the last several years, the research community has been interested
in learning from imbalanced data. Various researchers attempted to solve binary-class
imbalanced problems [16]. When various labels are present, the proposed solutions for
binary-class problems may not be directly applicable or may perform poorly. Most real-
world problems are multi-class problems. Machine learning is a well-known research field
in computer science that employs several algorithms to extract useful information from the
datasets. However, imbalanced data can lead to biased models [17], which may have nega-
tive impacts on marginalized communities and the environment. For example, a biased
Sustainability 2023, 15, 7097. https://doi.org/10.3390/su15097097 https://www.mdpi.com/journal/sustainability