A New Classification Algorithm using Mutual Nearest Neighbors Huawen Liu 1 , Shichao Zhang 1 , Jianming Zhao 1 , Xiangfu Zhao 1 , Yuchang Mo 1,2 1 Department of Computer Science, Zhejiang Normal University, Jinhua 321000, China 2 School of Computer Science and Engineering, Southeast University, Nanjing 211189, China {hwliu, zhangsc, zjm}@zjnu.cn Abstract—kNN is a simple, but effective and powerful lazy learning algorithm. It has been now widely used in practice and plays an important role in pattern classification. However, how to choose an optimal value of k is still a challenge, which straightforwardly affects the performance of kNN. To alleviate this problem, in this paper we propose a new learning algorithm under the framework of kNN. The primary characteristic of our method is that it adopts mutual nearest neighbors, rather than k nearest neighbors, to determine the class labels of unknown instances. The advantage of mutual neighbors is that pseudo nearest neighbors can be identified and will not be taken into account during the prediction process. As a result, the final result is more reasonable. Experimental results conducted on UCI datasets show that the classification performance achieved by our proposed method is better than the traditional one. Keywords—data mining, pattern classification, kNN, mutual nearest neighbor, data reduction I. INTRODUCTION Pattern classification is a fundamental problem in artificial intelligence, decision-making, machine learning, knowledge discovery and others [1]. It can be described generally as follows: given N training instances (or samples) with known class labels C, e.g., C={c 1 ,..,c m }, how to predict the class label of an unknown instance? That is to say, the purpose of pattern classification is mainly to predict or mark unknown instances with predefined labels in the light of the historical behaviors. Due to its significant role in exploratory pattern analysis, pattern classification has been extensively studied and many outstanding and different kinds of classification algorithms, such as Naïve Bayes (NBC) and C4.5, have been developed. Among them, the k-nearest neighbor learning algorithm (kNN) is one of the most straightforward approaches to classifying instances which are represented as points defined in some feature space. kNN is a kind of lazy learning algorithm. Given C predefined class labels and N labeled training instances {x i , c i }. kNN firstly stores all of these training instances in the memory. During predicting the class label of a new instance x 0 , it tries to find the k nearest neighbors of x 0 and determines the class label of x0 with a majority vote. Without prior knowledge, Euclidean distance is often used to measure the distances between instances. Despite the simplicity of kNN, this off-the- shelf method can still yield competitive results even compared to the most sophisticated machine learning methods. Many simulation experiments on a number of pattern recognition tasks have demonstrated its competitive performance [2]. As a matter of fact, the error rate of kNN method will be no greater than twice the Bayes error rate, independent of the distance metric applied. Moreover, with the size N approaching infinity and k increasing, the error rate converges asymptotically to the optimal Bayes error rate [3][4]. In addition, kNN is particularly effective when the probability distributions of the feature variables are not known. Just owing to these advantages, kNN has been now widely applied in many fields, including object recognition, classification and clustering, image processing, texture synthesis, information retrieval, intrusion detection and others [2]. Despite its easy-to-implement and relatively high accuracy, from a methodological perspective kNN also has several shortcomings which will lead to its poor performance if they have not been done well. Generally, the performance of a kNN classifier is primarily dependant on three aspects, i.e., the number k of nearest neighbors, the size N of training instances and the choice of distance metric [5], where the selection of k is more crucial. Usually, a larger value of k is more immune to the noise presented, while requires more computational cost. However, Domeniconi et al. in [6] pointed out that predetermining the value of k is very difficult especially when the points are not uniformly distributed. Therefore, different optimal values of k are needed for different applications, and how to choose the optimal value of k for most kNN extensions is still a challenge. To cope with this problem, many endeavors have been attempted to further improve the performance of kNN classifiers during past years. For example, one can resort to cross-validation (CV) technique to select a proper value of k, or employ special data structures, e.g. k-d tree or R-tree, to approximately find nearest neighbors [7]. However, the CV method requires much more computational cost, while data structures are improper to dynamic situations. Additionally, Zhang in [8] used shell neighbors to take the place of k nearest neighbors. From the view of overhead, it is desirable to enhance the performance of traditional kNN without imposing many constraints on this simple method. In this paper, a new classification algorithm, called MNN, is proposed. Unlike traditional kNN methods, which employ k nearest neighbors to predict the class label of new instance x 0 , MNN takes use of the concept of mutual nearest neighbor of x 0 to determine its label. To estimate the label of x 0 , MNN firstly 2010 Ninth International Conference on Grid and Cloud Computing 978-0-7695-4313-0/10 $26.00 © 2010 IEEE DOI 10.1109/GCC.2010.23 52