Cruncher: Distributed In-Memory Processing for Location-Based Services Ahmed S. Abdelhamid ⇤ , Mingjie Tang ⇤ , Ahmed M. Aly ⇤ , Ahmed R. Mahmood ⇤ , Thamir Qadah ⇤ , Walid G. Aref ⇤ , Saleh Basalamah ‡ ⇤ Purdue University, West Lafayette, IN, USA ‡ Umm Al-Qura University, Makkah, KSA Abstract—Advances in location-based services (LBS) demand high-throughput processing of both static and streaming data. Re- cently, many systems have been introduced to support distributed main-memory processing to maximize the query throughput. However, these systems are not optimized for spatial data process- ing. In this demonstration, we showcase Cruncher, a distributed main-memory spatial data warehouse and streaming system. Cruncher extends Spark with adaptive query processing tech- niques for spatial data. Cruncher uses dynamic batch processing to distribute the queries and the data streams over commodity hardware according to an adaptive partitioning scheme. The batching technique also groups and orders the overlapping spatial queries to enable inter-query optimization. Both the data streams and the offline data share the same partitioning strategy that allows for data co-locality optimization. Furthermore, Cruncher uses an adaptive caching strategy to maintain the frequently-used location data in main memory. Cruncher maintains operational statistics to optimize query processing, data partitioning, and caching at runtime. We demonstrate two LBS applications over Cruncher using real datasets from OpenStreetMap and two synthetic data streams. We demonstrate that Cruncher achieves order(s) of magnitude throughput improvement over Spark when processing spatial data. I. I NTRODUCTION The popularity of location-based services (LBS, for short) has resulted in an unprecedented increase in the volume of spatial information. In addition to the location attributes (e.g., longitude and latitude), the created data may include a temporal component (e.g., timestamp), and other application- driven attributes (e.g., check-in data, identifiers for moving objects, and associated textual content) [1]. Applications span a wide range of services, e.g., tracking moving objects, location- based advertisement, online-gaming, etc. Although these LBSs vary according to the nature of the underlying application, they share the need for high-throughput processing, low latency, support for adaptivity due to changes in location data distri- bution over time, and efficient utilization of the computing resources. This demands for the efficient processing of spatial data streams with high rates as well as huge amounts of static spatial data, e.g., OpenStreetMap. Moreover, the worldwide use of LBS applications requires processing of spatial queries at an unprecedented scale. For instance, LBSs are required to maintain information for tens if not hundreds of millions of users in addition to huge amounts of other service-associated data (e.g., maps and road networks), while processing millions of user requests and data updates per second. Cloud computing platforms, where hardware cost is asso- ciated with usage rather than ownership, call for enhancing the query processing and storage efficiency. Furthermore, the dynamic nature of location data, especially spatial data streams and workloads, render the conventional optimize-then-execute model inefficient, and calls for adaptive query processing techniques, where statistics are collected to fine-tune the query processing and storage at runtime (e.g., see [2]). One aspect that distinguishes LBSs is query complexity. In contrast to enterprise data applications, LBS queries are more sophisticated and can involve combinations of spatial, tempo- ral, and relational operators, e.g., see [3], [4]. Some of these operators are expensive, e.g., k-nearest-neighbor (kNN) [5]. To address these challenges, various parallel and distributed systems are customized to handle location data, e.g., MD- Hbase [1], HadoopGIS [6], SpatialHadoop [7], Parallel Sec- ondo [8], and Tornado [9]. These systems have a common goal; to store and query big spatial data over shared-nothing com- modity machines. However, they suffer from disk bottlenecks, and provide no provisions for adaptive query processing. Recently, the significant drop in main-memory cost has ini- tiated a wave of main-memory distributed processing systems. Spark [10] is an exemplification of such computing paradigm. Spark provides a shared-memory abstraction using Resilient Distributed Datasets (RDDs, for short) [11]. RDDs are im- mutable and support only coarse-grained operations (referred to as transformations). RDDs keep the history of transfor- mations (referred to as Lineage) for fault tolerance. RDDs are lazily evaluated and ephemeral. An RDD transformation is only computed upon data access (referred to as Actions) and data is kept in memory only upon deliberate request. In addition, Spark supports near-real-time data stream processing through small batches represented as RDDs (referred to as Discretized Streams) [12]. However, Spark is not optimized for spatial data processing and makes no assumptions about the underlying data or query types. This demonstration presents Cruncher, a distributed spatial data warehouse and streaming system. Cruncher provides high-throughput processing of online and offline spatial data. Cruncher extends Spark with adaptive query processing tech- niques. Originally, Spark processes data stream records in order of arrival. However, processing a batch of data elements or queries offers an opportunity for optimization and renders the fixed batch content ordering sub-optimal. Hence, Cruncher introduces a new batching technique, where the system dynam- ically changes the batch content ordering to update the RDDs efficiently. In addition, processing a batch of multiple queries offers an opportunity for multi-query optimization, and hence Cruncher introduces an inter-query optimization technique for range and kNN queries. Furthermore, Spark speeds-up the data processing by partitioning the data in main memory.