External Data Access and Indexing in AsterixDB
Abdullah Alamoudi, Raman Grover, Michael J. Carey, Vinayak Borkar
Dept. of Computer Science,
University of California Irvine, CA, USA - 92697
{alamouda, ramang, mjcarey, vborkar}@uci.edu
ABSTRACT
Traditional database systems offer rich query interfaces (SQL)
and efficient query execution for data that they store. Re-
cent years have seen the rise of Big Data analytics plat-
forms offering query-based access to “raw” external data,
e.g., file-resident data (often in HDFS). In this paper, we de-
scribe techniques to achieve the qualities offered by DBMSs
when accessing external data. This work has been built into
Apache AsterixDB, an open source Big Data Management
System. We describe how we build distributed indexes over
external data, partition external indexes, provide query con-
sistency across access paths, and manage external indexes
amidst concurrent activities. We compare the performance
of this new AsterixDB capability to an external-only solu-
tion (Hive) and to its internally managed data and indexes.
Categories and Subject Descriptors: H.2 [DATABASE
MANAGEMENT]: Systems - Query Processing
Keywords: AsterixDB, External data, HDFS, Access, In-
dexing.
1. INTRODUCTION
Database management systems employ many techniques
to achieve good performance. These include storage struc-
tures, access methods, caching, and efficient query execu-
tion. As a prerequisite, a DBMS requires data to be stored
into its storage layer and modified through its interface.
Having to pre-load data can be a major obstacle, particu-
larly when data is being produced in huge quantities by mul-
tiple sources and being persisted in different storage systems
and formats. Providing full-scan access to external data
from a DBMS is a first step that enables the use of queries
rather than error-prone ad hoc analysis scripts. However,
the lack of indexes is bound to give unacceptable query re-
sponse times. A natural next step is to support indexing for
external data.
Permission to make digital or hard copies of all or part of this work for personal or
classroom use is granted without fee provided that copies are not made or distributed
for profit or commercial advantage and that copies bear this notice and the full cita-
tion on the first page. Copyrights for components of this work owned by others than
ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or re-
publish, to post on servers or to redistribute to lists, requires prior specific permission
and/or a fee. Request permissions from Permissions@acm.org.
CIKM’15, October 19–23, 2015, Melbourne, Australia.
© 2015 ACM. ISBN 978-1-4503-3794-6/15/10 ...$15.00.
DOI: http://dx.doi.org/10.1145/2806416.2806428.
1.1 External Access and Indexing Challenges
Efficient and flexible external data access in a parallel data
manager involves challenges stemming from lack of control
over the representation of records, their storage locations,
and the data modification path(s). These include:
C1) Disparate Data Sources and Formats: An external ac-
cess facility must be generic and extensible to allow access
to data in different data sources and formats.
C2) Seamless Integration: The distinction between external
and internal data should not be visible to the end-user when
framing queries in the language offered by the system.
C3) Parallel/Efficient Access: A parallel data manager must
exploit parallelism in accessing external data and utilize dis-
tributed index structures to achieve load balancing.
C4) Maintenance of Indexes: External data may change,
causing indexes to become stale. The system must enable
users to refresh the indexes with transactional behavior.
C5) Consistency of Query Results: The system should offer
consistent query results for external data regardless of the
access path used and concurrent changes to external data.
1.2 Contributions
We describe how we have recently enhanced AsterixDB to
enable users to efficiently query externally stored data. Our
solution supports a set of popular external data source types
and formats but is extensible to cater to new data sources
and formats. It allows users to build multiple distributed
indexes (e.g., B+ Trees and/or R-Trees) over external data.
The query compiler then utilizes these indexes to accelerate
queries. We offer the following contributions:
(1) External Data Access and Indexing: We provide a com-
plete conceptual and technical design for providing access to
and building distributed indexes over external data.
(2) Access Semantics: We introduce external data snapshot
semantics for queries accessing indexed external data. We
describe mechanisms to keep indexes in known synchrony
with the data and provide consistent query results.
(3) Incremental Updates: We detail the design for refresh-
ing a dataset’s indexes incrementally in AsterixDB in an
efficient and robust manner.
(4) Contribution to Open-Source: AsterixDB is open-source.
Its support for adapters, parsers, and indexing is extensible
to allow other project contributors to add new adapters and
parsers for other external data sources and formats.
(5) Experimental Evaluation: We measure the external data
access performance for different HDFS formats, comparing
AsterixDB’s performance when data resides externally in
Hive tables [2] to that of Hive itself. We also compare per-
3