Scalable Spatio-temporal Indexing and Qerying over a Document-oriented NoSQL Store Nikolaos Koutroumanis Deptartment of Digital Systems University of Piraeus Piraeus, Greece koutroumanis@unipi.gr Christos Doulkeridis Deptartment of Digital Systems University of Piraeus Piraeus, Greece cdoulk@unipi.gr ABSTRACT In this paper, we provide an in-depth study of the performance of spatio-temporal queries in document-oriented NoSQL stores. Existing NoSQL stores provide limited support for spatial data and (quite often) no native support for spatio-temporal data. As a result, the performance of query execution over large collec- tions of spatio-temporal data is often suboptimal. We present an approach for indexing spatio-temporal data, which is applica- ble to any NoSQL store that provides key-based access to data without modifcations to its code, and we show how to generate data partitions that preserve data locality. Moreover, we show the impact of indexing and partitioning on the number of cluster nodes that serve a query, and we discuss the advantages and dis- advantages for diferent applications. We adopt a methodology for the evaluation of spatio-temporal range queries, which can serve as a benchmark. In our experiments, we focus on MongoDB (as a representative NoSQL store that provides spatial support) and we study the impact of indexing spatio-temporal data on performance, using both real-life and synthetic data sets in a medium-sized cluster. 1 INTRODUCTION Big spatio-temporal data sets are collected every day at unprece- dented rates [15, 17], due to emergent applications, such as feet management solutions, surveillance systems in maritime and aviation, human and animal tracking, IoT sensor feeds, location- based web search, and social networks with geotagged content. These applications generate huge volumes of positional infor- mation represented as points, which require scalable storage and retrieval, so that data analysis techniques can be applied to discover hidden spatio-temporal patterns. As a result, scal- able spatio-temporal data management is a challenging research topic, and efcient solutions are required for storage, indexing and querying. NoSQL stores [4, 7] comprise the state-of-the-art in scalable storage to date. However, while support for spatial data is pro- vided recently by an increasing number of NoSQL stores, this is seldom the case for spatio-temporal data. In fact, even spa- tial data access methods are not always optimized in today’s mainstream NoSQL stores. While most relational DBMSs have adopted R-trees [11] (or its variants [2, 16]) for efcient spatial indexing, NoSQL stores with spatial support adopt GeoHashes to map spatial data to one-dimensional (1D) values, which is then indexed using traditional 1D indexes, such as B-trees [6] (see Table 1). Our conjecture is that this decision relates to the cost of building and maintaining a distributed R-tree. Consequently, the © 2021 Copyright held by the owner/author(s). Published in Proceedings of the 24th International Conference on Extending Database Technology (EDBT), March 23-26, 2021, ISBN 978-3-89318-084-4 on OpenProceedings.org. Distribution of this paper is permitted under the terms of the Creative Commons license CC-by-nc-nd 4.0. Database Spatial Indexing RDBMS PostgreSQL (PostGIS extension) R-Tree MySQL R-tree Oracle R-tree MariaDB R-tree SQL Server B-tree SQLite (SpatiaLite extension) B-tree NoSQL MongoDB B-tree Redis (Geo API) Sorted Set DynamoDB B-tree Elasticsearch BKD-tree Neo4J B+Tree Table 1: Spatial support in most popular relational and NoSQL data stores performance of existing solutions is suboptimal, when faced with the challenge of efcient and scalable retrieval of spatio-temporal data. Our work is motivated by real-life applications, revolving around feet management operators in the urban domain, which collect large volumes of positional data from GPS-equipped vehi- cles daily. The specifc use-cases that are supported by our work relate to exploratory analysis of historical routes, using multiple spatio-temporal queries of varying granularity. The retrieved trajectories are analyzed for feet cost reduction (by analyzing the fuel consumption of historical routes), intelligent routing, as well as for discovering movement patterns. The challenge is to provide a scalable storage and spatio-temporal querying solution for large volumes of historical mobility data. Unfortunately, ex- isting industrial solutions are not optimized for spatio-temporal querying at scale, thus feet management operators apply data analysis techniques only on recent subsets of their historical database, while older data is kept in cold storage. Motivated by these limitations, in this application paper, we provide an in-depth study of querying spatio-temporal data at scale, focusing on a document-oriented NoSQL store, namely MongoDB. The choice of MongoDB is justifed due to its wide popularity among big data developers, and its maturity compared to other competitive technologies. We explain the internal de- tails of indexing and sharding, focusing on how spatial data is supported, and eventually design a solution for spatio-temporal data using the built-in indexes of MongoDB. Then, we propose an alternative approach that uses the Hilbert space-flling curve (which has been shown to have nice clustering properties [14]) to generate one-dimensional (1D) keys, which facilitates index- ing of spatio-temporal data, and allows to preserve data locality in the nodes of the MongoDB cluster. Moreover, this approach can be implemented on top of MongoDB (and other key-based NoSQL stores), thus being directly applicable for any application. Industrial Paper Series ISSN: 2367-2005 611 10.5441/002/edbt.2021.71