External Data Access and Indexing in AsterixDB Abdullah Alamoudi, Raman Grover, Michael J. Carey, Vinayak Borkar Dept. of Computer Science, University of California Irvine, CA, USA - 92697 {alamouda, ramang, mjcarey, vborkar}@uci.edu ABSTRACT Traditional database systems offer rich query interfaces (SQL) and efficient query execution for data that they store. Re- cent years have seen the rise of Big Data analytics plat- forms offering query-based access to “raw” external data, e.g., file-resident data (often in HDFS). In this paper, we de- scribe techniques to achieve the qualities offered by DBMSs when accessing external data. This work has been built into Apache AsterixDB, an open source Big Data Management System. We describe how we build distributed indexes over external data, partition external indexes, provide query con- sistency across access paths, and manage external indexes amidst concurrent activities. We compare the performance of this new AsterixDB capability to an external-only solu- tion (Hive) and to its internally managed data and indexes. Categories and Subject Descriptors: H.2 [DATABASE MANAGEMENT]: Systems - Query Processing Keywords: AsterixDB, External data, HDFS, Access, In- dexing. 1. INTRODUCTION Database management systems employ many techniques to achieve good performance. These include storage struc- tures, access methods, caching, and efficient query execu- tion. As a prerequisite, a DBMS requires data to be stored into its storage layer and modified through its interface. Having to pre-load data can be a major obstacle, particu- larly when data is being produced in huge quantities by mul- tiple sources and being persisted in different storage systems and formats. Providing full-scan access to external data from a DBMS is a first step that enables the use of queries rather than error-prone ad hoc analysis scripts. However, the lack of indexes is bound to give unacceptable query re- sponse times. A natural next step is to support indexing for external data. Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full cita- tion on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or re- publish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from Permissions@acm.org. CIKM’15, October 19–23, 2015, Melbourne, Australia. © 2015 ACM. ISBN 978-1-4503-3794-6/15/10 ...$15.00. DOI: http://dx.doi.org/10.1145/2806416.2806428. 1.1 External Access and Indexing Challenges Efficient and flexible external data access in a parallel data manager involves challenges stemming from lack of control over the representation of records, their storage locations, and the data modification path(s). These include: C1) Disparate Data Sources and Formats: An external ac- cess facility must be generic and extensible to allow access to data in different data sources and formats. C2) Seamless Integration: The distinction between external and internal data should not be visible to the end-user when framing queries in the language offered by the system. C3) Parallel/Efficient Access: A parallel data manager must exploit parallelism in accessing external data and utilize dis- tributed index structures to achieve load balancing. C4) Maintenance of Indexes: External data may change, causing indexes to become stale. The system must enable users to refresh the indexes with transactional behavior. C5) Consistency of Query Results: The system should offer consistent query results for external data regardless of the access path used and concurrent changes to external data. 1.2 Contributions We describe how we have recently enhanced AsterixDB to enable users to efficiently query externally stored data. Our solution supports a set of popular external data source types and formats but is extensible to cater to new data sources and formats. It allows users to build multiple distributed indexes (e.g., B+ Trees and/or R-Trees) over external data. The query compiler then utilizes these indexes to accelerate queries. We offer the following contributions: (1) External Data Access and Indexing: We provide a com- plete conceptual and technical design for providing access to and building distributed indexes over external data. (2) Access Semantics: We introduce external data snapshot semantics for queries accessing indexed external data. We describe mechanisms to keep indexes in known synchrony with the data and provide consistent query results. (3) Incremental Updates: We detail the design for refresh- ing a dataset’s indexes incrementally in AsterixDB in an efficient and robust manner. (4) Contribution to Open-Source: AsterixDB is open-source. Its support for adapters, parsers, and indexing is extensible to allow other project contributors to add new adapters and parsers for other external data sources and formats. (5) Experimental Evaluation: We measure the external data access performance for different HDFS formats, comparing AsterixDB’s performance when data resides externally in Hive tables [2] to that of Hive itself. We also compare per- 3