Journal of Intelligent & Fuzzy Systems 38 (2020) 2661–2672 DOI:10.3233/JIFS-179552 IOS Press 2661 A new text-based w-distance metric to ﬁnd the perfect match between words Munwar Ali a,∗ , Low Tang Jung b , Osama Hosam c,d , Asif Ali Wagan e , Rehan Ali Shah f and Mashael Khayyat g a Department of IT, Shaheed Benazir Bhutto University, Shaheed Benazirabad, Sindh, Pakistan b Deparment of Computer and Information Sciences, Universiti Teknologi PETRONAS, Malaysia c The College of Computer Science and Engineering in Yanbu, Taibah University, Medina, Saudi Arabia d Informatics Research Institute, The City for Scientiﬁc Research and Technology Applications, Alexandria, Egypt e Department of Computer Science, SMIU, Karachi, Pakistan f Department of Computer Systems Engineering, Faculty of Engineering, The Islamia University Bahawalpur, Pakistan g Department of Information Systems and Technology, Faculty of Computer Science and Engineering, University of Jeddah, Jeddah, Saudi Arabia Abstract. The k-NN algorithm is an instance-based learning algorithm which is widely used in the data mining applications. The core engine of the k-NN algorithm is the distance/similarity function. The performance of the k-NN algorithm varies with the selection of distance function. The traditional distance/similarity functions in k-NN do not perfectly handle the mix-mode words such as when one string has multiple substrings/words. For example, a two-word string of “Employee Name”, a one-word string of “Name” or more than one word such as, “Name of Employee”. This ambiguity is faced by different distance/similarity functions causing difﬁculties in ﬁnding the perfect match of words. To improve the perfect- match calculation functionality in the traditional k-NN algorithm, a new similarity distance metric is developed and named as word-distance (w-distance). The perfect match will help us to identify the exact required value. The proposed w-distance is a hybrid of distance and similarity in nature because it is to handle dissimilarity and similarity features of strings at the same time. The simulation results showed that w-distance has a better impact on the performance of the k-NN algorithm as compared to the Euclidean distance and the cosine similarity. Keywords: k-NN algorithm, distance/similarity metric, text match, data mining, cosine similarity 1. Introduction An unpublished technical report was written in 1951 by Evelyn Fix and J.I. Hodges at the USA AirForce school of Aviation Medicine. A non-parafunction method was proposed for the classiﬁcation of patterns. Later, this non-parafunction ∗ Corresponding author. Munwar Ali, Department of IT, Sha- heed Benazir Bhutto University, Shaheed Benazirabad, Sindh, Pakistan. E-mail: mazardari@gmail.com. method was named as k-NN algorithm [1]. The k-NN algorithm is a simplest instance-based method and it has less computational complexity at the training phase than other algorithms, such as decision trees, neural networks and Bayes nets [2]. The k-NN clas- siﬁer is implemented on nearest neighbor principle in which the closeness of test instance is determined with training instances. The decision of label is taken based on the majority vote. The label of majority closest instances is assigned to the test instance. The ISSN 1064-1246/20/$35.00 © 2020 – IOS Press and the authors. All rights reserved