Effects of Distance Measure Choice on KNN Classifier Performance - A Review V. B. Surya Prasath a,b,c,d, , Haneen Arafat Abu Alfeilat e , Ahmad B. A. Hassanat e , Omar Lasassmeh e , Ahmad S. Tarawneh f , Mahmoud Bashir Alhasanat g,h , Hamzeh S. Eyal Salman e a Division of Biomedical Informatics, Cincinnati Children’s Hospital Medical Center, OH 45229 USA b Department of Pediatrics, University of Cincinnati College of Medicine, Cincinnati, OH USA c Department of Biomedical Informatics, College of Medicine, University of Cincinnati, OH 45267 USA d Department of Electrical Engineering and Computer Science, University of Cincinnati, OH 45221 USA e Department of Information Technology, Mutah University, Karak, Jordan f Department of Algorithm and Their Applications, Etvs Lornd University, Budapest, Hungary g Department of Geomatics, Faculty of Environmental Design. King Abdulaziz University, Jeddah, Saudi Arabia h Faculty of Engineering, Al-Hussein Bin Talal University, Maan, Jordan Abstract The K-nearest neighbor (KNN) classifier is one of the simplest and most common classifiers, yet its performance competes with the most complex classifiers in the literature. The core of this classifier depends mainly on measuring the distance or similarity between the tested examples and the training examples. This raises a major question about which distance measures to be used for the KNN classifier among a large number of distance and similarity measures available? This review attempts to answer this question through evaluating the performance (measured by accuracy, precision and recall) of the KNN using a large number of distance measures, tested on a number of real-world datasets, with and without adding different levels of noise. The experimental results show that the performance of KNN classifier depends significantly on the distance used, and the results showed large gaps between the performances of different distances. We found that a recently proposed non-convex distance performed the best when applied on most datasets comparing to the other tested distances. In addition, the performance of the KNN with this top performing distance degraded only about 20% while the noise level reaches 90%, this is true for most of the distances used as well. This means that the KNN classifier using any of the top 10 distances tolerate noise to a certain degree. Moreover, the results show that some distances are less affected by the added noise comparing to other distances. Keywords: K-nearest neighbor, big data, machine learning, noise, supervised learning 1. Introduction Classification is an important problem in big data, data science and machine learning. The K-nearest neighbor (KNN) is one of the oldest, simplest and accurate algorithms for patterns classification and regression models. KNN was proposed in 1951 by [20], and then modified by [15]. KNN has been identified as one of the top ten methods in data mining [82]. Consequently, KNN has been studied over the past few decades and widely applied in many fields [8]. Thus, KNN comprises the baseline classifier in many pattern classification problems such as pattern recognition [84], text categorization [54], ranking models [83], object recognition [6], and event recognition [85] applications. KNN is a non-parametric algorithm [45]. Non-Parametric means either there are no parameters or fixed number of parameters irrespective of size of data. Instead, parameters would be determined by the size of the training dataset. While there are no assumptions that need to be made to the underlying data distribution. Thus, KNN could be the best choice for any classification study that involves a little or no prior knowledge about the distribution of the data. In addition, KNN is one of the laziest learning methods. This implies storing all training data and waits until having the test data produced, without having to create a learning model [76]. * Corresponding author. Tel.: +1 513 636 2755 Email address: prasatsa@uc.edu (V. B. Surya Prasath) 1 arXiv:1708.04321v3 [cs.LG] 29 Sep 2019