A New Classification Algorithm using Mutual
Nearest Neighbors
Huawen Liu
1
, Shichao Zhang
1
, Jianming Zhao
1
, Xiangfu Zhao
1
, Yuchang Mo
1,2
1
Department of Computer Science, Zhejiang Normal University, Jinhua 321000, China
2
School of Computer Science and Engineering, Southeast University, Nanjing 211189, China
{hwliu, zhangsc, zjm}@zjnu.cn
Abstract—kNN is a simple, but effective and powerful lazy
learning algorithm. It has been now widely used in practice and
plays an important role in pattern classification. However, how to
choose an optimal value of k is still a challenge, which
straightforwardly affects the performance of kNN. To alleviate
this problem, in this paper we propose a new learning algorithm
under the framework of kNN. The primary characteristic of our
method is that it adopts mutual nearest neighbors, rather than k
nearest neighbors, to determine the class labels of unknown
instances. The advantage of mutual neighbors is that pseudo
nearest neighbors can be identified and will not be taken into
account during the prediction process. As a result, the final result
is more reasonable. Experimental results conducted on UCI
datasets show that the classification performance achieved by our
proposed method is better than the traditional one.
Keywords—data mining, pattern classification, kNN, mutual
nearest neighbor, data reduction
I. INTRODUCTION
Pattern classification is a fundamental problem in artificial
intelligence, decision-making, machine learning, knowledge
discovery and others [1]. It can be described generally as
follows: given N training instances (or samples) with known
class labels C, e.g., C={c
1
,..,c
m
}, how to predict the class label
of an unknown instance? That is to say, the purpose of pattern
classification is mainly to predict or mark unknown instances
with predefined labels in the light of the historical behaviors.
Due to its significant role in exploratory pattern analysis,
pattern classification has been extensively studied and many
outstanding and different kinds of classification algorithms,
such as Naïve Bayes (NBC) and C4.5, have been developed.
Among them, the k-nearest neighbor learning algorithm (kNN)
is one of the most straightforward approaches to classifying
instances which are represented as points defined in some
feature space.
kNN is a kind of lazy learning algorithm. Given C
predefined class labels and N labeled training instances {x
i
, c
i
}.
kNN firstly stores all of these training instances in the memory.
During predicting the class label of a new instance x
0
, it tries
to find the k nearest neighbors of x
0
and determines the class
label of x0 with a majority vote. Without prior knowledge,
Euclidean distance is often used to measure the distances
between instances. Despite the simplicity of kNN, this off-the-
shelf method can still yield competitive results even compared
to the most sophisticated machine learning methods. Many
simulation experiments on a number of pattern recognition
tasks have demonstrated its competitive performance [2]. As a
matter of fact, the error rate of kNN method will be no greater
than twice the Bayes error rate, independent of the distance
metric applied. Moreover, with the size N approaching infinity
and k increasing, the error rate converges asymptotically to the
optimal Bayes error rate [3][4]. In addition, kNN is particularly
effective when the probability distributions of the feature
variables are not known. Just owing to these advantages, kNN has
been now widely applied in many fields, including object
recognition, classification and clustering, image processing,
texture synthesis, information retrieval, intrusion detection and
others [2].
Despite its easy-to-implement and relatively high accuracy,
from a methodological perspective kNN also has several
shortcomings which will lead to its poor performance if they
have not been done well. Generally, the performance of a kNN
classifier is primarily dependant on three aspects, i.e., the
number k of nearest neighbors, the size N of training instances
and the choice of distance metric [5], where the selection of k
is more crucial. Usually, a larger value of k is more immune to
the noise presented, while requires more computational cost.
However, Domeniconi et al. in [6] pointed out that
predetermining the value of k is very difficult especially when
the points are not uniformly distributed. Therefore, different
optimal values of k are needed for different applications, and
how to choose the optimal value of k for most kNN extensions
is still a challenge.
To cope with this problem, many endeavors have been
attempted to further improve the performance of kNN
classifiers during past years. For example, one can resort to
cross-validation (CV) technique to select a proper value of k,
or employ special data structures, e.g. k-d tree or R-tree, to
approximately find nearest neighbors [7]. However, the CV
method requires much more computational cost, while data
structures are improper to dynamic situations. Additionally,
Zhang in [8] used shell neighbors to take the place of k nearest
neighbors. From the view of overhead, it is desirable to
enhance the performance of traditional kNN without imposing
many constraints on this simple method.
In this paper, a new classification algorithm, called MNN, is
proposed. Unlike traditional kNN methods, which employ k
nearest neighbors to predict the class label of new instance x
0
,
MNN takes use of the concept of mutual nearest neighbor of x
0
to determine its label. To estimate the label of x
0
, MNN firstly
2010 Ninth International Conference on Grid and Cloud Computing
978-0-7695-4313-0/10 $26.00 © 2010 IEEE
DOI 10.1109/GCC.2010.23
52