International Research Journal of Engineering and Technology (IRJET) e-ISSN: 2395-0056
Volume: 04 Issue: 09 | Sep -2017 www.irjet.net p-ISSN: 2395-0072
© 2017, IRJET | Impact Factor value: 5.181 | ISO 9001:2008 Certified Journal | Page 1385
Enhancing Classification Accuracy of K-Nearest Neighbours Algorithm
Using Gain Ratio
Aditya Duneja
1
, Thendral Puyalnithi
2
1
School Of Computer Science and Engineering , VIT University, Vellore, 632014 , Tamil Nadu, India
2
Professor, School Of Computer Science and Engineering, VIT University, Vellore, 632014, Tamil Nadu, India
---------------------------------------------------------------------***---------------------------------------------------------------------
Abstract – K-Nearest Neighbors or KNN classification
technique in machine learning is among the simplest of
techniques which are used for classification. The approach
used in KNN involves finding the k nearest neighbors to a
test vector and then assigning a class to it based on majority
voting. One of the assumptions in KNN is that data lies in a
feature space, that is, measures like Euclidean distance,
Manhattan distance and others can be applied on data. It
has its applications in computer vision, recommender
systems and various other fields. In this paper, a
modification to this method has been put forward which
aims at improving its accuracy. The modified version
combines the concepts of distance metrics, which are
applied in K-nearest neighbors, and the entropy or the
strength of an attribute. The way in which the distance is
calculated between the data points is altered as the strength
of each individual attribute or dimension is taken into
consideration. The proposed technique was tested on
various UCI datasets and the accuracies of both methods,
the old and the modified, are compared.
Key Words: classification, entropy, Euclidean distance,
KNN, dimension
1. Introduction
In machine learning, classification is a method which uses
a training set comprising of various instances and their
respective classes to determine the category of a new
observation or a set of new observations. It is a supervised
learning technique. Classification has a wide range of
applications in various fields, namely medical imaging,
speech recognition, handwriting recognition, search
engines and many more. So, high accuracy classification
models are required in order to make correct predictions
in the above mentioned fields. Some of the classification
models are decision trees, Support Vector Machines and K-
nearest neighbor classifier.
K-nearest neighbor or KNN classifier is one of the most
frequently used classification techniques in machine
learning. K nearest neighbor algorithm involves the
computation of the distance of the test observation with
every observation in the training set. The distance which is
computed is mostly Euclidean distance although other
measures of distance can also be applied. Then the k
nearest observations are chosen and the class which is
assigned to the majority out of these k observations is
assigned to the test observation. KNN makes no
assumptions regarding the distribution of data, therefore,
it is non-parametric. No separate training phase is defined
for this algorithm as the whole dataset is used for
computation of distance every time.
KNN has undergone many modifications over the years in
order to improve its accuracy. Faruk et al.[1] proposed
new version of KNN where the neighbors are chosen with
respect to the angle between them. Even if two neighbors
have the same distance from the test vector they are
considered different with respect to the angle they make.
Singh et al. [2] devised a way to accelerate the calibration
process of KNN using parallel computing. Parvin et al. [3]
modified the KNN algorithm by weighting the neighbors of
the unknown observation. The contribution of every
neighbor is considered to be inversely proportional to its
distance from the test vector. Fuzzy K nearest neighbors
was first proposed in 1985 by Keller et al.[4] Fuzzy
memberships are assigned to the training data in order to
obtain more accurate results as every neighbor is given a
different weight. A new modification to the K-nearest
neighbors algorithm is proposed in this paper taking the
strength of the attributes into account. Faziludeen et al.[5]
proposed a new modification to KNN known as the
Evidential K nearest neighbors (EKNN). The resultant
class from all k nearest neighbors of a test sample were
used to find the class, not only the majority class. Distance
between samples was considered to measure the weight
of the evidence attributed to each class. Some research is
also done to reduce the size of the dataset required to get
effective results from the KNN algorithm. Song et al.[6]
proposed an algorithm which removed the outlier
instances from the dataset and the remaining were sorted
on the basis of difference in output among instances and
their neighbors. Another modification to KNN was put
forward by Nguyen et al.[8] in the form of a distance
metric learning method. The resulting problem was
formulated as a quadratic program. Lin et al.[9] proposed
a nearest neighbor classifier by combining neighborhood
information. While finding the distance between samples,
the influence of their neighbors is taken into account.
Manocha et al. [10] proposed a probabilistic nearest
neighbor classifier which used probabilistic methods to
assign class membership, an aspect which KNN doesn’t
have. Sarkar[11] improved the accuracy of the KNN by