Eﬀects of Distance Measure Choice on KNN Classiﬁer Performance - A Review V. B. Surya Prasath a,b,c,d,∗ , Haneen Arafat Abu Alfeilat e , Ahmad B. A. Hassanat e , Omar Lasassmeh e , Ahmad S. Tarawneh f , Mahmoud Bashir Alhasanat g,h , Hamzeh S. Eyal Salman e a Division of Biomedical Informatics, Cincinnati Children’s Hospital Medical Center, OH 45229 USA b Department of Pediatrics, University of Cincinnati College of Medicine, Cincinnati, OH USA c Department of Biomedical Informatics, College of Medicine, University of Cincinnati, OH 45267 USA d Department of Electrical Engineering and Computer Science, University of Cincinnati, OH 45221 USA e Department of Information Technology, Mutah University, Karak, Jordan f Department of Algorithm and Their Applications, Etvs Lornd University, Budapest, Hungary g Department of Geomatics, Faculty of Environmental Design. King Abdulaziz University, Jeddah, Saudi Arabia h Faculty of Engineering, Al-Hussein Bin Talal University, Maan, Jordan Abstract The K-nearest neighbor (KNN) classiﬁer is one of the simplest and most common classiﬁers, yet its performance competes with the most complex classiﬁers in the literature. The core of this classiﬁer depends mainly on measuring the distance or similarity between the tested examples and the training examples. This raises a major question about which distance measures to be used for the KNN classiﬁer among a large number of distance and similarity measures available? This review attempts to answer this question through evaluating the performance (measured by accuracy, precision and recall) of the KNN using a large number of distance measures, tested on a number of real-world datasets, with and without adding diﬀerent levels of noise. The experimental results show that the performance of KNN classiﬁer depends signiﬁcantly on the distance used, and the results showed large gaps between the performances of diﬀerent distances. We found that a recently proposed non-convex distance performed the best when applied on most datasets comparing to the other tested distances. In addition, the performance of the KNN with this top performing distance degraded only about 20% while the noise level reaches 90%, this is true for most of the distances used as well. This means that the KNN classiﬁer using any of the top 10 distances tolerate noise to a certain degree. Moreover, the results show that some distances are less aﬀected by the added noise comparing to other distances. Keywords: K-nearest neighbor, big data, machine learning, noise, supervised learning 1. Introduction Classiﬁcation is an important problem in big data, data science and machine learning. The K-nearest neighbor (KNN) is one of the oldest, simplest and accurate algorithms for patterns classiﬁcation and regression models. KNN was proposed in 1951 by [20], and then modiﬁed by [15]. KNN has been identiﬁed as one of the top ten methods in data mining [82]. Consequently, KNN has been studied over the past few decades and widely applied in many ﬁelds [8]. Thus, KNN comprises the baseline classiﬁer in many pattern classiﬁcation problems such as pattern recognition [84], text categorization [54], ranking models [83], object recognition [6], and event recognition [85] applications. KNN is a non-parametric algorithm [45]. Non-Parametric means either there are no parameters or ﬁxed number of parameters irrespective of size of data. Instead, parameters would be determined by the size of the training dataset. While there are no assumptions that need to be made to the underlying data distribution. Thus, KNN could be the best choice for any classiﬁcation study that involves a little or no prior knowledge about the distribution of the data. In addition, KNN is one of the laziest learning methods. This implies storing all training data and waits until having the test data produced, without having to create a learning model [76]. * Corresponding author. Tel.: +1 513 636 2755 Email address: prasatsa@uc.edu (V. B. Surya Prasath) 1 arXiv:1708.04321v3 [cs.LG] 29 Sep 2019