Mathematics and Cybernetics – applied aspects
47
Copyright © 2023, Authors. This is an open access article under the Creative Commons CC BY license
UNBALANCED
CREDIT FRAUD
MODELING BASED
ON BAGGING
AND BAYESIAN
OPTIMIZATION
Mohammed A. Kashmoola
Master, Lecturer*
Samah Fakhri Aziz
Master, Lecturer*
Hasan Mudhafar Qays
Master, Lecturer*
Naors Y. Anad Alsaleem
Corresponding author
PhD, Asstans Professor*
Е-mail: nawrasyounis@uohamdaniya.edu.iq
*Department of Computer Science
University of Al-Hamdaniya
Ninavah, 79CF+PV, Bartela,
Hamdaniya, Iraq, 1528200
Credit fraud modeling is a crucial area of research that is high-
ly relevant to the credit loan industry. Effective risk management is
a key factor in providing quality credit services and directly
impacts the profitability and bad debt ratio of leading organizations
in this sector. However, when the distribution of credit fraud data
is highly unbalanced, it can lead to noise errors caused by informa-
tion distortion, periodic statistical errors, and model biases during
training. This can cause unfair results for the minority class (target
class) and increase the risk of overfitting. While traditional data
balancing methods can reduce bias in models towards the majori-
ty class in relatively unbalanced data, they may not be effective in
highly unbalanced scenarios. To address this challenge, this paper
proposes using Bagging algorithms such as Random Forest and
Bagging to model highly unbalanced credit fraud data. Bayesian
optimization is utilized to find hyperparameters and determine the
accuracy of the minority class as an optimization function for the
model, which is tested with real European credit card fraud data.
The results of the proposed packing algorithms are compared with
traditional data balancing methods such as Balanced Bagging
and Balanced Random Forest. The study found that traditional
data balancing methods may not be compatible with excessive-
ly unbalanced data, whereas Bagging algorithms show promise as
a solution for modeling such data. The proposed method for find-
ing hyperparameters effectively deals with highly unbalanced data.
It achieved precision, recall, and F1-score for the minority cate-
gory of 0.94, 0.81, and 0.87, respectively. The study emphasizes
the importance of addressing the challenges associated with unba-
lanced credit fraud data to improve the accuracy and fairness of
credit fraud models
Keywords: unbalanced data, Bayesian optimization, random
forest, majority and minority class
UDC 621.391
DOI: 10.15587/1729-4061.2023.279936
How to Cite: Kashmoola, M. A., Aziz, S. F., Qays, H. M., Alsaleem, N. Y. A. (2023). Unbalanced credit fraud modeling
based on bagging and bayesian optimization. Eastern-European Journal of Enterprise Technologies, 3 (4 (123)), 47–53.
doi: https://doi.org/10.15587/1729-4061.2023.279936
Received date 09.03.2023
Accepted date 19.05.2023
Published date 30.06.2023
1. Introduction
The banking industry’s growth has led to increasing
market volatility and credit fraud, and the widespread use
of financial derivatives has further contributed to this trend.
One of the critical challenges in credit fraud detection is to
accurately rate an applicant’s creditworthiness based on the
objective laws present in the credit data. The rating is a binary
classification problem determining whether the applicant has
committed credit fraud [1, 2]. The identification and early
warning of credit fraud are crucial, and the success of the risk
identification model is closely related to data balance. Thus,
finding models unaffected by data imbalance, as with credit
fraud, is highly useful. Borrowers who appear reliable have
a high success rate and exhibit a low default rate [3, 4]. Cur-
rent research focuses on overcoming the problem of model
bias towards the majority category by comparing the effects
of data balancing methods. The main data balancing methods
include changing data distribution and improving the algo-
rithm’s level. The former involves over-sampling algorithms,
such as SMOTE [5] and ADASYN [6], combined with un-
der-sampling algorithms [7–12], which generate new samples.
However, overfitting of the classifier may occur when the
imbalance ratio is too large, and poor classification results may
occur in the test set if a large number of majority class samples
are discarded. The latter approach involves cost-sensitive me-
thods that assign different misclassification costs to different
classes and higher misclassification costs to minority class sam-
ples erroneously classified as the majority class [13, 14]. How-
ever, determining the cost factor for misclassification is chal-
lenging. Alternatively, ensemble learning, which combines
multiple learners to obtain better learning effects than a single
learner, can be used [15]. Boosting, a commonly used ensemble
method can model unbalanced data sets. However, the subjec-
tive cost function can lead to difficulties in defining it. Therefore,
most scholars focus on the ensemble classification algorithm
based on data processing, which includes combining oversampl-
ing and boosting [16] and SMOTBoosting [17, 18]. Despite
the above methods, unbalanced data classification algorithms
still have some drawbacks, especially when the imbalance ra-
tio is large. Therefore, studies that are devoted to addressing
the challenges associated with unbalanced credit fraud data
are of scientific relevance in the modern credit loan industry.
2. Literature review and problem statement
The paper [19] proposes a feature fusion-based machine
learning model for fraud detection, which showed promising
results in identifying fraudulent transactions. However, there