Mathematics and Cybernetics – applied aspects 47 Copyright © 2023, Authors. This is an open access article under the Creative Commons CC BY license UNBALANCED CREDIT FRAUD MODELING BASED ON BAGGING AND BAYESIAN OPTIMIZATION Mohammed A. Kashmoola Master, Lecturer* Samah Fakhri Aziz Master, Lecturer* Hasan Mudhafar Qays Master, Lecturer* Naors Y. Anad Alsaleem Corresponding author PhD, Asstans Professor* Е-mail: nawrasyounis@uohamdaniya.edu.iq *Department of Computer Science University of Al-Hamdaniya Ninavah, 79CF+PV, Bartela, Hamdaniya, Iraq, 1528200 Credit fraud modeling is a crucial area of research that is high- ly relevant to the credit loan industry. Effective risk management is a key factor in providing quality credit services and directly impacts the profitability and bad debt ratio of leading organizations in this sector. However, when the distribution of credit fraud data is highly unbalanced, it can lead to noise errors caused by informa- tion distortion, periodic statistical errors, and model biases during training. This can cause unfair results for the minority class (target class) and increase the risk of overfitting. While traditional data balancing methods can reduce bias in models towards the majori- ty class in relatively unbalanced data, they may not be effective in highly unbalanced scenarios. To address this challenge, this paper proposes using Bagging algorithms such as Random Forest and Bagging to model highly unbalanced credit fraud data. Bayesian optimization is utilized to find hyperparameters and determine the accuracy of the minority class as an optimization function for the model, which is tested with real European credit card fraud data. The results of the proposed packing algorithms are compared with traditional data balancing methods such as Balanced Bagging and Balanced Random Forest. The study found that traditional data balancing methods may not be compatible with excessive- ly unbalanced data, whereas Bagging algorithms show promise as a solution for modeling such data. The proposed method for find- ing hyperparameters effectively deals with highly unbalanced data. It achieved precision, recall, and F1-score for the minority cate- gory of 0.94, 0.81, and 0.87, respectively. The study emphasizes the importance of addressing the challenges associated with unba- lanced credit fraud data to improve the accuracy and fairness of credit fraud models Keywords: unbalanced data, Bayesian optimization, random forest, majority and minority class UDC 621.391 DOI: 10.15587/1729-4061.2023.279936 How to Cite: Kashmoola, M. A., Aziz, S. F., Qays, H. M., Alsaleem, N. Y. A. (2023). Unbalanced credit fraud modeling based on bagging and bayesian optimization. Eastern-European Journal of Enterprise Technologies, 3 (4 (123)), 47–53. doi: https://doi.org/10.15587/1729-4061.2023.279936 Received date 09.03.2023 Accepted date 19.05.2023 Published date 30.06.2023 1. Introduction The banking industry’s growth has led to increasing market volatility and credit fraud, and the widespread use of financial derivatives has further contributed to this trend. One of the critical challenges in credit fraud detection is to accurately rate an applicant’s creditworthiness based on the objective laws present in the credit data. The rating is a binary classification problem determining whether the applicant has committed credit fraud [1, 2]. The identification and early warning of credit fraud are crucial, and the success of the risk identification model is closely related to data balance. Thus, finding models unaffected by data imbalance, as with credit fraud, is highly useful. Borrowers who appear reliable have a high success rate and exhibit a low default rate [3, 4]. Cur- rent research focuses on overcoming the problem of model bias towards the majority category by comparing the effects of data balancing methods. The main data balancing methods include changing data distribution and improving the algo- rithm’s level. The former involves over-sampling algorithms, such as SMOTE [5] and ADASYN [6], combined with un- der-sampling algorithms [7–12], which generate new samples. However, overfitting of the classifier may occur when the imbalance ratio is too large, and poor classification results may occur in the test set if a large number of majority class samples are discarded. The latter approach involves cost-sensitive me- thods that assign different misclassification costs to different classes and higher misclassification costs to minority class sam- ples erroneously classified as the majority class [13, 14]. How- ever, determining the cost factor for misclassification is chal- lenging. Alternatively, ensemble learning, which combines multiple learners to obtain better learning effects than a single learner, can be used [15]. Boosting, a commonly used ensemble method can model unbalanced data sets. However, the subjec- tive cost function can lead to difficulties in defining it. Therefore, most scholars focus on the ensemble classification algorithm based on data processing, which includes combining oversampl- ing and boosting [16] and SMOTBoosting [17, 18]. Despite the above methods, unbalanced data classification algorithms still have some drawbacks, especially when the imbalance ra- tio is large. Therefore, studies that are devoted to addressing the challenges associated with unbalanced credit fraud data are of scientific relevance in the modern credit loan industry. 2. Literature review and problem statement The paper [19] proposes a feature fusion-based machine learning model for fraud detection, which showed promising results in identifying fraudulent transactions. However, there