International Research Journal of Engineering and Technology (IRJET) e-ISSN: 2395-0056 Volume: 04 Issue: 07 | July -2017 www.irjet.net p-ISSN: 2395-0072 © 2017, IRJET | Impact Factor value: 5.181 | ISO 9001:2008 Certified Journal | Page 1807 Parallel kNN for Big Data using Adaptive Indexing Tejal Katore 1 , Prof. Dr. Suhasini Itkar 2 1Post Graduate Scholar, Dept. of Computer Engineering, P.E.S Modern College of Engineering, Pune, India 2Professor, Dept. of Computer Engineering, P.E.S Modern College of Engineering, Pune, India ---------------------------------------------------------------------***--------------------------------------------------------------------- Abstract - k Nearest Neighbor is frequently used in classification methods. kNN algorithm defines the class membership of the given element. kNN when used in context with large data, does not perform well. So multiple techniques were introduced to execute kNN parallely and enhance its performance. Along with this MapReduce programming model was used which was suitable for distributed approaches. The different reference algorithms were given as follows HzknnJ, HBNJ, RankReduce which compute kNN on MapReduce. Data preprocessing, Data partitioning and computation are the three common steps for kNN computation. For all given solutions only the partitioning technique differs. Adaptive Indexing is a indexing paradigm where index creation and reorganization takes place automatically and incrementally. It was used along with the RankReduce algorithm which helps knn to exec more efficiently. Key Words: Hadoop Block Nested Loop kNN (H-BNLJ), Hadoop z value (H-zkNNJ), k Nearest Neighbor, MapReduce, Performance Evaluation, RankReduce. 1. INTRODUCTION k Nearest Neighbor is widely used as a classification or clustering method in machine learning or data mining[1]. The k-Nearest Neighbor algorithm (k-NN) [2] is considered one of the ten most significantly data mining algorithms . It is an lazy learner which do not need absolute training phase. The method requires that all of the data instances are stored and unseen cases classified by finding the class labels of the k closest instances to them[3]. To determine how close two instances are, several distances can be computed. This operation as to be performed for all the input examples against the whole training dataset. Given R is a point and S is set of reference points, a k nearest neighbor join is an operation which for each point in R, discovers the k nearest neighbor in S. The data points are divided into training set and testing set, also called unlabeled data. The aim is to find the class label for the new points. For each unlabeled data, a kNN query on the training set will be performed to estimate its class membership. This process can be considered as a kNN join of the testing set with the training set. The basic idea to compute a kNN join is to perform a pairwise computation of distance for each element in R and each element in S . The difficulties mainly lie in the following two aspect: (1) Data Volume (2)Data Dimensionality. A lot of work has been dedicated to reduce the in-memory com-putational complexity [1]. These works mainly focus on two points: (1) Use indexes to decrease the number of distances need to be calculated. These indexes can hardly be scaled on high dimension data. (2) Use projections to reduce the dimensionality of data. But the maintenance of the accuracy becomes another issue. Despite these efforts, there are still significant limitations to process kNN on a centralized machine when the amount of data increases [4],[10],[11]. Only distributed and parallel solutions are proved to be powerful, for large dataset . MapReduce is a flexible and scalable parallel and distributed programming paradigm which is specially designed for data-intensive processing. MapReduce is a parallel programming model that aims at efficiently processing large-datasets. It consists of:(1) representing a key-value pair (2)defining map function (3)defining reduce function. Here we introduce the reference algorithms that compute kNN over MapReduce. These algorithms are based on different methods, but follow a common work-flow which consists three ordered steps:(1)data pre-processing (2)data partitioning (3) kNN computation. 2. LITERATURE REVIEW kNN is based on a distance function that measures the difference or similarity between two instances. kNN using centralized approach was not able to perform for large inputs. So a new approach to execute it parallelly was developed. There are various existing solutions to perform the kNN operation in the context of MapReduce are given. The approach HBNLJ[1] consists of two phases. The data set is divided into a certain blocks of particular size. The data is partitioned such a that an element in a partition of R will have its nearest neighbor in only one partitioned of S. Two partitioning strategies that enable to separate the datasets into independent partitions, while preserving locality information, are proposed. H-zkNNJ [1],[4], which use size based partitioning strategies, have a very good load balance, with a very small deviation of the completion time of each task. In H-zkNNJ, the z -value transformation leads to information loss. The recall of this algorithm is influenced by the nature, the dimension and the size of the input data. More specifically, this algorithm becomes biased if the distance between initial data is very scattered, and the more input and M , the number of hash functions in each family. Since they are dependent on the dataset, experiments are