Citation: Karim, F.K.; Elmannai, H.; Seleem, A.; Hamad, S.; Mostafa, S.M. Handling Missing Values Based on Similarity Classiﬁers and Fuzzy Entropy Measures. Electronics 2022, 11, 3929. https://doi.org/ 10.3390/electronics11233929 Academic Editor: Andrei Kelarev Received: 19 October 2022 Accepted: 20 November 2022 Published: 28 November 2022 Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional afﬁl- iations. Copyright: © 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https:// creativecommons.org/licenses/by/ 4.0/). electronics Article Handling Missing Values Based on Similarity Classiﬁers and Fuzzy Entropy Measures Faten Khalid Karim 1, *, Hela Elmannai 2 , Abdelrahman Seleem 3 , Safwat Hamad 4 and Samih M. Mostafa 3 1 Department of Computer Sciences, College of Computer and Information Sciences, Princess Nourah bint Abdulrahman University, P.O. Box 84428, Riyadh 11671, Saudi Arabia 2 Department of Information Technology, College of Computer and Information Sciences, Princess Nourah bint Abdulrahman University, P.O. Box 84428, Riyadh 11671, Saudi Arabia 3 Computer Science Department, Faculty of Computers and Information, South Valley University, Qena 83523, Egypt 4 Scientiﬁc Computing Department, Faculty of Computer and Information Sciences, Ain Shams University, Cairo 11566, Egypt * Correspondence: fkdiaaldin@pnu.edu.sa Abstract: Handling missing values (MVs) and feature selection (FS) are vital preprocessing tasks for many pattern recognition, data mining, and machine learning (ML) applications, involving classiﬁcation and regression problems. The existence of MVs in data badly affects making decisions. Hence, MVs have to be taken into consideration during preprocessing tasks as a critical problem. To this end, the authors proposed a new algorithm for manipulating MVs using FS. Bayesian ridge regression (BRR) is the most beneﬁcial type of Bayesian regression. BRR estimates a probabilistic model of the regression problem. The proposed algorithm is dubbed as cumulative Bayesian ridge with similarity and Luca’s fuzzy entropy measure (CBRSL). CBRSL reveals how the fuzzy entropy FS used for selecting the candidate feature holding MVs aids in the prediction of the MVs within the selected feature using the Bayesian Ridge technique. CBRSL can be utilized to manipulate MVs within other features in a cumulative order; the ﬁlled features are incorporated within the BRR equation in order to predict the MVs for the next selected incomplete feature. An experimental analysis was conducted on four datasets holding MVs generated from three missingness mechanisms to compare CBRSL with state-of-the-art practical imputation methods. The performance was measured in terms of R 2 score (determination coefﬁcient), RMSE (root mean square error), and MAE (mean absolute error). Experimental results indicate that the accuracy and execution times differ depending on the amount of MVs, the dataset’s size, and the mechanism type of missingness. In addition, the results show that CBRSL can manipulate MVs generated from any missingness mechanism with a competitive accuracy against the compared methods. Keywords: missingness mechanisms; feature selection; bayesian ridge regression; imputation; similarity classiﬁer 1. Introduction Data refers to cases or instances from the ambit that characterize the issue to be solved. In data management, one of the most important concerns is the quality of the data. Incomplete data often leads to bad decisions and negative analytics of the data. Researchers and analysts may face barriers when dealing with incomplete data. In addition, knowledge discovery becomes difﬁcult to conduct with incomplete data, which means that the data quality comes ﬁrst and foremost before working with the data [1]. The most popular form of data involves so-called tabular or structured data (i.e., rows of instances and columns of features for instances). The acquisition and collection of data may lead to errors in the data, for example, replicated entries, outliers, mixed formats, typos, MVs, etc. Error detection (i.e., errors are identiﬁed by experts) and error repair (i.e., bringing the data to Electronics 2022, 11, 3929. https://doi.org/10.3390/electronics11233929 https://www.mdpi.com/journal/electronics