Efficient Multi-dimensional Spatial RkNN Query Processing with MapReduce Changqing Ji 1,2 , Hongbin Hu 3 , Yujie Xu 1 , Yuanyuan Li 1 , Wenyu Qu 1 1. College of Information Science and Technology, Dalian Maritime University, Dalian, 116026, China 2. College of Physical Science and Technology, Dalian University, Dalian, 116622, China 3. Inner Mongolia Electric Power Research Insitute, Huhot, 010020, China E-mail: {jcqgood, yujiex.dlmu, lyy3232312, eunice.qu}@gmail.com Abstract—Reverse k Nearest Neighbor (RkNN) queries are of particular interest in a wide range of data mining applications such as decision support systems, profile based marketing and spatial database etc. With the increasing volume of spatial data, it is difficult to perform RkNN queries efficiently because of the limited computational capability and storage resources. In this paper, we investigate how to perform distributed RkNN queries using MapReduce. Firstly, we investigate the Basic- MRRkNN query method based on the inverted grid index over large scale spatial datasets. Secondly, we propose an optimization method: Lazy-MRRkNN query algorithm that prunes the search space when all data points are discovered. To the best of our knowledge, it is the first time that we propose exact RkNN processing algorithms using MapReduce on multi-dimensional datasets. Extensive experiments using both real and synthetic datasets demonstrated that our proposed methods are efficient and scalable. Keywords- Reverse Nearest Neighbors; MapReduce; Spatial Databases. I. INTRODUCTION In recent years, smart phones and tablets remarkably started to carry sensors like GPS, Camera, Bluetooth, etc. The proliferation of mobile sensors and LBS applications have made a reality that large scale spatial data collections contain billions of space coordinate information. Spatial indexing and querying large volumes of multi-dimensional spatial data are major challenges. RkNN (k Nearest Neighbor) query retrieves the objects (in the queried multi-dimensional dataset) whose k nearest neighbors (kNN) include the query point q which is a typical spatial query algorithms draw a lot of attention in recent years. It is popular in intelligent navigation, modern communications, traffic control, profile based marketing, spatial clustering and other areas. As an example of profile based marketing using RkNN. A restaurant supposes that has a marketing application to determine the business impact of restaurants to each other at the centre of the New York. RkNN queries have been studied quite extensively, existing works (such as TPL [1] and Voronoi [2] etc.) are based on the centralized paradigm and performed on a single centralized server. Because of the limited computational capability and storage of a single machine, the system will eventually suffer from performance deterioration as the size of the dataset increases, especially for multi-dimensional datasets. Motivated by these shortcomings of traditional methods, we present new exact RkNN and optimized algorithms on large-scale and multi- dimensional spatial datasets. MapReduce proposed by Google [3], is a very popular big data processing model that has rapidly been studied and applied both industry and academia. MapReduce has emerged as one of the most widely used parallel computing platforms for processing data on terabyte and petabyte scales. What’s more, it allows for easy parallelization of data intensive computations over many machines. In previous work of big data processing [4], we divided existing MapReduce applications into three categories: partitioning sub-space, decomposing sub-processes and approximate overlapping calculations. The paper belongs to the second category. Most of existing works rely on some centralized indexing structure such as the tree-based index [5] which cannot be used for distributed and parallel environment. In [6], they adopt the Voronoi diagram partitioning based approach and simply apply MapReduce to answer RkNN and other queries. However, Voronoi is limited to 2D space and the location of a point in Voronoi based index takes extra time. The state of the art of RkNN method is TPL [7] developed by Yufei Tao et al. They used half-space pruning for exact RkNN processing with arbitrary values of k on dynamic, multi-dimensional datasets. However, the hierarchical R-tree indices do not scale due to the traditional top-down search. Complex structure and inherent sequential characteristics make it difficult to be paralleled. So, unfortunately, TPL method can only run on a single machine and does not scale well. Thus, efficient distributed RkNN method is still an open problem. To the best of knowledge, this is the first work that proposes a spatial RkNN query based on MapReduce. The contribution of this paper can be summarized as follows: (1) Simplicity. We use the simple and distributed inverted grid index over large-scale datasets. We also design basic method for processing reverse k nearest neighbors. (2) Scalable. We present new decouple methods, by using our methods the filter and refinement steps become two 2013 8th Annual ChinaGrid Conference 978-0-7695-5058-9/13 $26.00 © 2013 IEEE DOI 10.1109/ChinaGrid.2013.17 63