SAMUEL:AS haring-based A pproach to processing Mu ltiple
SPARQL Qeries with MapReduce
∗
InA Kim
Dep’t of Computer Engineering
Chungnam National University
Daejeon, 34134, Korea
dodary0214@gmail.com
Kyong-Ha Lee
2
Korea Institute of Science and
Technology Information
Daejeon, 34141, Korea
bart7449@gmail.com
Kyu-Chul Lee
Dep’t of Computer Engineering
Chungnam National University
Daejeon, 34134, Korea
kclee@cnu.ac.kr
ABSTRACT
The volume of RDF data is now growing tremendously. It is
thus considered prudent to store and process massive RDF data
with distributed SPARQL engines instead of relying on a single-
machine system. Many sophisticated index and partitioning schemes
have also been proposed to support SPARQL query evaluations.
However, existing SPARQL engines have mainly followed one-
at-a-time scheme so that query evaluation is focused only on
processing each query separately. We showcase SAMUEL, a dis-
tributed SPARQL engine that simultaneously evaluates many
SPARQL queries for a massive RDF dataset with MapReduce.
SAMUEL provides an efcient optimization algorithm to evaluate
many SPARQL queries simultaneously in a shared and balanced
way. Extensive experiments present that without any sophisti-
cated partitioning or index mechanisms, our approach signif-
cantly outperforms other MapReduce-based SPARQL engines as
well as an ad-hoc query engine equipped with various indexes
and partitioning tools for evaluating multiple SPARQL queries.
1 INTRODUCTION
The Resource Description Framework (RDF) is a versatile graph
data model that enables users to express facts and their relation-
ships in the form of triples ⟨ Subject, Predicate, Object ⟩ where a
predicate (P ) expresses a relationship between a subject (S) and
an object (O)[2]. Many knowledge bases are now defned in the
RDF format, e.g., DBpedia [16], Bio2RDF [6] and UniProt [7] and
they shape a very large set of graphs having millions of triples
interlinked each other and are queried by using SPARQL query
language [1]. It has been a major challenge to fnd subgraph
patterns described in given SPARQL queries from a massive set
of RDF graphs with supporting both efciency and scalability.
To address the issue, numerous SPARQL query engines have
been devised based on MapReduce [15] or their proprietary dis-
tributed architectures [3]. Meanwhile, multiple SPARQL queries
often need to be evaluated together as the queries can be pre-
pared before runtime in some scenarios [13, 20]. Motivated by
these facts and our previous work on XML data [8], we devise
a MapReduce-based SPARQL engine called SAMUEL, which si-
multaneously evaluates many SPARQL queries in a shared and
balanced way. Major features of SAMUEL are as follows :
Support of parallel multi-SPARQL query processing
SAMUEL provides an efcient means to process a massive set of
∗
This work is equally supported by KISTI and NRF grant (No. NRF-2015R1A2A2
A01004879) funded by the Korea government.
2
Corresponding author
© 2018 Copyright held by the owner/author(s). Published in Proceedings of the 21st
International Conference on Extending Database Technology (EDBT), March 26-29,
2018, ISBN 978-3-89318-078-3 on OpenProceedings.org.
Distribution of this paper is permitted under the terms of the Creative Commons
license CC-by-nc-nd 4.0.
RDF data in parallel. It does not require any sophisticated par-
titioning or index mechanisms for SPARQL query evaluation.
Nonetheless, SAMUEL easily outperforms other MapReduce-
based distributed SPARQL engines by simultaneously evaluating
multiple SPARQL queries with a short series of MapReduce jobs.
Sharing input scans and intermediate results
SAMUEL enables computing nodes to share input scans and their
intermediate results with each other. While joining RDF triples
for evaluating queries, a group of join operations assigned to
each reducer share their input and intermediate results associated
with distinct subquery patterns that multiple queries commonly
contain. Consequently, it saves many I/Os by removing many
redundant subquery matchings while evaluating queries.
Runtime load balancing and multi-query optimization
In a distributed system, a straggling task delays overall job execu-
tion. This problem deteriorates with the use of MapReduce since
MapReduce typically enforces barrier synchronization between
Map and Reduce tasks. MapReduce’s native runtime scheduling
algorithm also proved to be inefcient especially in the reduce
stage [12, 15]. To address the issue, SAMUEL rather exploits dy-
namic shufing scheme that balances workloads across reducers
at each MapReduce job in accordance with the cardinality of RDF
triples that each reducer consumes for joining. It decomposes a
given set of SPARQL queries into distinct triple patterns and then
gradually builds a bushy query plan tree that covers all RDF join
operations required to evaluate the given SPARQL queries. Since
processing join operations at each level in the query plan tree
requires a single M/R job, SAMUEL always tries to build a bushy
plan tree with the lowest possible height. Then, it partitions the
join operations at each level into n groups and then assigns them
to n reducers. To balance workloads across reducers that actually
perform join operations, SAMUEL computes the cost of each join
operation at each level of the plan tree before actual joining. It
then assigns join operations into reducers at each level such that
every reducer has the same overall cost of join operations and the
lowest communication cost by avoiding redundant RDF triples
being transferred to multiple reducers so far as possible.
The rest of this paper is organized as follows. Section 2 intro-
duces previous studies directly related to our work. Section 3
describes how we evaluate multiple SPARQL queries together
at a time and our system architecture that provides workload
balancing as well as multi-SPARQL query processing. Section 4
presents our demonstration scenario including the major results
of performance evaluations.
2 RELATED WORK
Numerous distributed SPARQL engines have been devised to
store RDF data and to evaluate SPARQL queries so far [3, 9,
11, 14, 19, 21, 22]. Readers are referred to a recent survey on
distributed SPARQL query engines [3]. These systems fall into
Demonstration
Series ISSN: 2367-2005 666 10.5441/002/edbt.2018.80