Efficient Multi-dimensional
Spatial RkNN Query Processing with MapReduce
Changqing Ji
1,2
, Hongbin Hu
3
, Yujie Xu
1
, Yuanyuan Li
1
, Wenyu Qu
1
1. College of Information Science and Technology, Dalian Maritime University, Dalian, 116026, China
2. College of Physical Science and Technology, Dalian University, Dalian, 116622, China
3. Inner Mongolia Electric Power Research Insitute, Huhot, 010020, China
E-mail: {jcqgood, yujiex.dlmu, lyy3232312, eunice.qu}@gmail.com
Abstract—Reverse k Nearest Neighbor (RkNN) queries are of
particular interest in a wide range of data mining applications
such as decision support systems, profile based marketing and
spatial database etc. With the increasing volume of spatial data,
it is difficult to perform RkNN queries efficiently because of
the limited computational capability and storage resources. In
this paper, we investigate how to perform distributed RkNN
queries using MapReduce. Firstly, we investigate the Basic-
MRRkNN query method based on the inverted grid index over
large scale spatial datasets. Secondly, we propose an
optimization method: Lazy-MRRkNN query algorithm that
prunes the search space when all data points are discovered.
To the best of our knowledge, it is the first time that we
propose exact RkNN processing algorithms using MapReduce
on multi-dimensional datasets. Extensive experiments using
both real and synthetic datasets demonstrated that our
proposed methods are efficient and scalable.
Keywords- Reverse Nearest Neighbors; MapReduce;
Spatial Databases.
I. INTRODUCTION
In recent years, smart phones and tablets remarkably started
to carry sensors like GPS, Camera, Bluetooth, etc. The
proliferation of mobile sensors and LBS applications have
made a reality that large scale spatial data collections
contain billions of space coordinate information. Spatial
indexing and querying large volumes of multi-dimensional
spatial data are major challenges.
RkNN (k Nearest Neighbor) query retrieves the objects
(in the queried multi-dimensional dataset) whose k nearest
neighbors (kNN) include the query point q which is a
typical spatial query algorithms draw a lot of attention in
recent years. It is popular in intelligent navigation, modern
communications, traffic control, profile based marketing,
spatial clustering and other areas. As an example of profile
based marketing using RkNN. A restaurant supposes that
has a marketing application to determine the business
impact of restaurants to each other at the centre of the New
York.
RkNN queries have been studied quite extensively,
existing works (such as TPL [1] and Voronoi [2] etc.) are
based on the centralized paradigm and performed on a
single centralized server. Because of the limited
computational capability and storage of a single machine,
the system will eventually suffer from performance
deterioration as the size of the dataset increases, especially
for multi-dimensional datasets. Motivated by these
shortcomings of traditional methods, we present new exact
RkNN and optimized algorithms on large-scale and multi-
dimensional spatial datasets.
MapReduce proposed by Google [3], is a very popular
big data processing model that has rapidly been studied and
applied both industry and academia. MapReduce has
emerged as one of the most widely used parallel computing
platforms for processing data on terabyte and petabyte
scales. What’s more, it allows for easy parallelization of
data intensive computations over many machines. In
previous work of big data processing [4], we divided
existing MapReduce applications into three categories:
partitioning sub-space, decomposing sub-processes and
approximate overlapping calculations. The paper belongs to
the second category.
Most of existing works rely on some centralized
indexing structure such as the tree-based index [5] which
cannot be used for distributed and parallel environment. In
[6], they adopt the Voronoi diagram partitioning based
approach and simply apply MapReduce to answer RkNN
and other queries. However, Voronoi is limited to 2D space
and the location of a point in Voronoi based index takes
extra time. The state of the art of RkNN method is TPL [7]
developed by Yufei Tao et al. They used half-space pruning
for exact RkNN processing with arbitrary values of k on
dynamic, multi-dimensional datasets. However, the
hierarchical R-tree indices do not scale due to the traditional
top-down search. Complex structure and inherent sequential
characteristics make it difficult to be paralleled. So,
unfortunately, TPL method can only run on a single
machine and does not scale well. Thus, efficient distributed
RkNN method is still an open problem.
To the best of knowledge, this is the first work that
proposes a spatial RkNN query based on MapReduce.
The contribution of this paper can be summarized as
follows:
(1) Simplicity. We use the simple and distributed inverted
grid index over large-scale datasets. We also design basic
method for processing reverse k nearest neighbors.
(2) Scalable. We present new decouple methods, by using
our methods the filter and refinement steps become two
2013 8th Annual ChinaGrid Conference
978-0-7695-5058-9/13 $26.00 © 2013 IEEE
DOI 10.1109/ChinaGrid.2013.17
63