Integration of ML Techniques for Early Detection of Breast Cancer: Dimensionality
Reduction Approach
Wial Hanon
1*
, Mahdi Abed Salman
2
1
Information Technology, Software Department, University of Babylon, Hilla 51001, Iraq
2
College of Science for Women, Department of Computer Science, University of Babylon, Hilla 51001, Iraq
Corresponding Author Email: wailh@uobabylon.edu.iq
Copyright: ©2024 The authors. This article is published by IIETA and is licensed under the CC BY 4.0 license
(http://creativecommons.org/licenses/by/4.0/).
https://doi.org/10.18280/isi.290134 ABSTRACT
Received: 19 October 2023
Revised: 16 January 2024
Accepted: 23 January 2024
Available online: 27 February 2024
Nowadays, the diagnosis of breast cancer (DBC) helps doctors make early detection of
breast cancer into non-cancerous (benign B) and cancerous (malignant M). Therefore, using
machine learning (ML) algorithms is a solution to diagnosing and predicting symptoms
related to DBC. The increased computational complexity, data size, overfitting, and longer
training times harm early diagnosis accuracy. In this paper, propose a dimensionality
reduction model integrating PCA and KNN for early breast cancer detection. which is used
to diagnose and predict breast cancer (DPBC) based on reduced data size by selecting the
best features that capture most of the variance in the data. The performance of the proposed
model is evaluated with indices such as accuracy, precision, and f-score. Results for the
DPBC model were obtained by using the Breast Cancer Wisconsin medical datasets (BCW).
Keywords:
Principal Component Analysis PCA, K-
nearest neighbor KNN, integrate PCA and
KNN DPBC, dimensionality reduction,
diagnosis breast cancer DBC, Breast Cancer
Wisconsin medical dataset BCW
1. INTRODUCTION
Globally, breast cancer, sometimes referred to as carcinoma,
is the primary cause of death for women. Before affecting any
nearby organs, it initially attacks the tissue in the breasts. If
not detected in its early stages, it may turn deadly [1, 2]. Breast
cancer is classified as either benign or malignant, depending
on whether it is cancerous or non-cancerous. Differentiating
between the tissue of a benign and malignant breast tumor is
difficult [3].
ML-based methods benefit oncologists in making better
medical decisions by making treating the condition simple and
affordable. The network of neurons became a substitute for
choosing the best qualities [4]. In deep learning, features are
directly learned from data using several non-linear processing
layers [5], but it still needs some of the requirements, such as
a large amount of labeled data, expensive, uninterpretable,
data bias, and long training time [6].
ML has been widely employed for computation processing
because of its proven ability to improve and raise accuracy for
both performance and prediction. The most well-known
algorithms are neural networks, decision trees, random forests,
and support vector machines (SVM). It is possible to use
predictions or facts derived from experience. To determine the
most precise link between variables, a variety of application
methods can be applied, such as early breast cancer prediction,
forecasting jobs, and time-series techniques [7, 8].
By developing prediction models, it may be possible to
identify diseases earlier and provide patients with more
effective treatment. ML models have demonstrated significant
performance when used to diagnose breast cancer in earlier
research [9, 10].
ML has been widely used in remote sensing because it can
provide accurate predicted input-output data with strong
correlations. Numerous options for biophysical parameter
retrievals and applications are presented by this [11, 12].
The use of Internet of Things (IoTs) devices has become a
necessity in our lives today, especially in the fields of health
care [13]. In the same context, the increasing volume of data
generated needs to be reduced to facilitate the transfer process
to applications and cloud centers for the purpose of processing
and analysis [14].
The PCA technique, which does not need data labeling and
is a common dimensionality reduction method because to its
simplicity and ease of implementation, is an example of an
unsupervised learning technique. Its primary premise is the
separation of feature groups, with the goal of reducing
reciprocal correlation and sorting in accordance with a
dropping eigenvalue and subsequently a declining variance.
Principal components are another name for eigenvectors. They
are initially subject to standard normalization because of the
different feature domains [15].
In this paper, an integrated PCA and KNN algorithm for the
breast cancer prediction model is proposed. The
dimensionality reduction is used to enhance accuracy by
selecting the best features. The mode is used to increase
accuracy and speed in detecting breast cancer using the
smallest possible number of features extracted from a CT scan
or MRI scan image. Feature selection is embedded by
computing the explained variance ratios and cumulative
variance to understand how much variance each component
explains, and then selecting the best number of components
Ingénierie des Systèmes d’Information
Vol. 29, No. 1, February, 2024, pp. 347-353
Journal homepage: http://iieta.org/journals/isi
347