Integration of ML Techniques for Early Detection of Breast Cancer: Dimensionality Reduction Approach Wial Hanon 1* , Mahdi Abed Salman 2 1 Information Technology, Software Department, University of Babylon, Hilla 51001, Iraq 2 College of Science for Women, Department of Computer Science, University of Babylon, Hilla 51001, Iraq Corresponding Author Email: wailh@uobabylon.edu.iq Copyright: ©2024 The authors. This article is published by IIETA and is licensed under the CC BY 4.0 license (http://creativecommons.org/licenses/by/4.0/). https://doi.org/10.18280/isi.290134 ABSTRACT Received: 19 October 2023 Revised: 16 January 2024 Accepted: 23 January 2024 Available online: 27 February 2024 Nowadays, the diagnosis of breast cancer (DBC) helps doctors make early detection of breast cancer into non-cancerous (benign B) and cancerous (malignant M). Therefore, using machine learning (ML) algorithms is a solution to diagnosing and predicting symptoms related to DBC. The increased computational complexity, data size, overfitting, and longer training times harm early diagnosis accuracy. In this paper, propose a dimensionality reduction model integrating PCA and KNN for early breast cancer detection. which is used to diagnose and predict breast cancer (DPBC) based on reduced data size by selecting the best features that capture most of the variance in the data. The performance of the proposed model is evaluated with indices such as accuracy, precision, and f-score. Results for the DPBC model were obtained by using the Breast Cancer Wisconsin medical datasets (BCW). Keywords: Principal Component Analysis PCA, K- nearest neighbor KNN, integrate PCA and KNN DPBC, dimensionality reduction, diagnosis breast cancer DBC, Breast Cancer Wisconsin medical dataset BCW 1. INTRODUCTION Globally, breast cancer, sometimes referred to as carcinoma, is the primary cause of death for women. Before affecting any nearby organs, it initially attacks the tissue in the breasts. If not detected in its early stages, it may turn deadly [1, 2]. Breast cancer is classified as either benign or malignant, depending on whether it is cancerous or non-cancerous. Differentiating between the tissue of a benign and malignant breast tumor is difficult [3]. ML-based methods benefit oncologists in making better medical decisions by making treating the condition simple and affordable. The network of neurons became a substitute for choosing the best qualities [4]. In deep learning, features are directly learned from data using several non-linear processing layers [5], but it still needs some of the requirements, such as a large amount of labeled data, expensive, uninterpretable, data bias, and long training time [6]. ML has been widely employed for computation processing because of its proven ability to improve and raise accuracy for both performance and prediction. The most well-known algorithms are neural networks, decision trees, random forests, and support vector machines (SVM). It is possible to use predictions or facts derived from experience. To determine the most precise link between variables, a variety of application methods can be applied, such as early breast cancer prediction, forecasting jobs, and time-series techniques [7, 8]. By developing prediction models, it may be possible to identify diseases earlier and provide patients with more effective treatment. ML models have demonstrated significant performance when used to diagnose breast cancer in earlier research [9, 10]. ML has been widely used in remote sensing because it can provide accurate predicted input-output data with strong correlations. Numerous options for biophysical parameter retrievals and applications are presented by this [11, 12]. The use of Internet of Things (IoTs) devices has become a necessity in our lives today, especially in the fields of health care [13]. In the same context, the increasing volume of data generated needs to be reduced to facilitate the transfer process to applications and cloud centers for the purpose of processing and analysis [14]. The PCA technique, which does not need data labeling and is a common dimensionality reduction method because to its simplicity and ease of implementation, is an example of an unsupervised learning technique. Its primary premise is the separation of feature groups, with the goal of reducing reciprocal correlation and sorting in accordance with a dropping eigenvalue and subsequently a declining variance. Principal components are another name for eigenvectors. They are initially subject to standard normalization because of the different feature domains [15]. In this paper, an integrated PCA and KNN algorithm for the breast cancer prediction model is proposed. The dimensionality reduction is used to enhance accuracy by selecting the best features. The mode is used to increase accuracy and speed in detecting breast cancer using the smallest possible number of features extracted from a CT scan or MRI scan image. Feature selection is embedded by computing the explained variance ratios and cumulative variance to understand how much variance each component explains, and then selecting the best number of components Ingénierie des Systèmes d’Information Vol. 29, No. 1, February, 2024, pp. 347-353 Journal homepage: http://iieta.org/journals/isi 347