International Research Journal of Engineering and Technology (IRJET) e-ISSN: 2395-0056
Volume: 04 Issue: 07 | July -2017 www.irjet.net p-ISSN: 2395-0072
© 2017, IRJET | Impact Factor value: 5.181 | ISO 9001:2008 Certified Journal | Page 1807
Parallel kNN for Big Data using Adaptive Indexing
Tejal Katore
1
, Prof. Dr. Suhasini Itkar
2
1Post Graduate Scholar, Dept. of Computer Engineering, P.E.S Modern College of Engineering, Pune, India
2Professor, Dept. of Computer Engineering, P.E.S Modern College of Engineering, Pune, India
---------------------------------------------------------------------***---------------------------------------------------------------------
Abstract - k Nearest Neighbor is frequently used in
classification methods. kNN algorithm defines the class
membership of the given element. kNN when used in context
with large data, does not perform well. So multiple techniques
were introduced to execute kNN parallely and enhance its
performance. Along with this MapReduce programming model
was used which was suitable for distributed approaches. The
different reference algorithms were given as follows HzknnJ,
HBNJ, RankReduce which compute kNN on MapReduce. Data
preprocessing, Data partitioning and computation are the
three common steps for kNN computation. For all given
solutions only the partitioning technique differs. Adaptive
Indexing is a indexing paradigm where index creation and
reorganization takes place automatically and incrementally. It
was used along with the RankReduce algorithm which helps
knn to exec more efficiently.
Key Words: Hadoop Block Nested Loop kNN (H-BNLJ),
Hadoop z value (H-zkNNJ), k Nearest Neighbor,
MapReduce, Performance Evaluation, RankReduce.
1. INTRODUCTION
k Nearest Neighbor is widely used as a classification or
clustering method in machine learning or data mining[1].
The k-Nearest Neighbor algorithm (k-NN) [2] is considered
one of the ten most significantly data mining algorithms . It is
an lazy learner which do not need absolute training phase.
The method requires that all of the data instances are stored
and unseen cases classified by finding the class labels of the
k closest instances to them[3]. To determine how close two
instances are, several distances can be computed. This
operation as to be performed for all the input examples
against the whole training dataset.
Given R is a point and S is set of reference points, a k
nearest neighbor join is an operation which for each point in
R, discovers the k nearest neighbor in S. The data points are
divided into training set and testing set, also called unlabeled
data. The aim is to find the class label for the new points. For
each unlabeled data, a kNN query on the training set will be
performed to estimate its class membership. This process
can be considered as a kNN join of the testing set with the
training set. The basic idea to compute a kNN join is to
perform a pairwise computation of distance for each element
in R and each element in S . The difficulties mainly lie in the
following two aspect: (1) Data Volume (2)Data
Dimensionality. A lot of work has been dedicated to reduce
the in-memory com-putational complexity [1]. These works
mainly focus on two points: (1) Use indexes to decrease the
number of distances need to be calculated. These indexes can
hardly be scaled on high dimension data. (2) Use projections
to reduce the dimensionality of data. But the maintenance of
the accuracy becomes another issue. Despite these efforts,
there are still significant limitations to process kNN on a
centralized machine when the amount of data increases
[4],[10],[11].
Only distributed and parallel solutions are proved to be
powerful, for large dataset . MapReduce is a flexible and
scalable parallel and distributed programming paradigm
which is specially designed for data-intensive processing.
MapReduce is a parallel programming model that aims at
efficiently processing large-datasets. It consists of:(1)
representing a key-value pair (2)defining map function
(3)defining reduce function. Here we introduce the reference
algorithms that compute kNN over MapReduce. These
algorithms are based on different methods, but follow a
common work-flow which consists three ordered
steps:(1)data pre-processing (2)data partitioning (3) kNN
computation.
2. LITERATURE REVIEW
kNN is based on a distance function that measures the
difference or similarity between two instances. kNN using
centralized approach was not able to perform for large
inputs. So a new approach to execute it parallelly was
developed. There are various existing solutions to perform
the kNN operation in the context of MapReduce are given.
The approach HBNLJ[1] consists of two phases. The data set
is divided into a certain blocks of particular size. The data is
partitioned such a that an element in a partition of R will
have its nearest neighbor in only one partitioned of S. Two
partitioning strategies that enable to separate the datasets
into independent partitions, while preserving locality
information, are proposed. H-zkNNJ [1],[4], which use size
based partitioning strategies, have a very good load balance,
with a very small deviation of the completion time of each
task. In H-zkNNJ, the z -value transformation leads to
information loss. The recall of this algorithm is influenced by
the nature, the dimension and the size of the input data.
More specifically, this algorithm becomes biased if the
distance between initial data is very scattered, and the more
input and M , the number of hash functions in each family.
Since they are dependent on the dataset, experiments are